New Algebraic Formulation of Density Functional 

Calculation 

Sohrab Ismail-Beigi^ and T.A. Arias* 

f Department of Physics, Massachusetts Institute of Technology, 
Cambridge, Massachusetts 02139 

J Laboratory of Atomic and Solid State Physics, Cornell University, 

Ithaca, New York 14853 
and 

J Research Laboratory of Electronics, Massachusetts Institute of Technology, 

Cambridge, Massachusetts 02139 

Suggested PACS codes: 71.15.-m, 71.15.Ap, 71.15.Fv, 71.15.Hx. 

Keywords: density-functional theory, ab initio calculations, electronic 
structure methods, electronic structure calculations, high 
performance computing, parallel computing, computer languages 

In press: Comp. Phys. Comm., vol. 128, pp. 1-45 (June 2000). 

Abstract 

This article addresses a fundamental problem faced by the community employing 
single-particle ab initio methods: the lack of an effective formalism for the rapid ex- 
ploration and exchange of new methods. To rectify this, we introduce a new, basis-set 
independent, matrix-based formulation of generalized density functional theories which 
reduces the development, implementation, and dissemination of new techniques to the 
derivation and transcription of a few lines of algebra. This new framework enables us 
to concisely demystify the inner workings of fully functional, highly efficient modern 
ab initio codes and to give complete instructions for their construction for calculations 
employing arbitrary basis sets. Within this framework, we also discuss in full de- 
tail a variety of leading-edge techniques, minimization algorithms, and highly efficient 
computational kernels for use with scalar as well as shared and distributed-memory 
supercomputer architectures. 
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1 Introduction 

This work gives a self-contained description of how to build a highly flexible, portable density- 
functional production code which attains significant fractions of peak performance on scalar 
cached architectures, shared-memory processors (SMP), and distributed- memory processors 
(DMP). More importantly, however, this work introduces a new formalism, DFT++, for 
the development, implementation, and dissemination of new ab initio generalized functional 
theoretic techniques among researchers. The most well-known and widely used generalized 
functional theory (GFT) is density-functional theory, where the energy of the system is 
parametrized as a functional of the electron density. Although the formalism presented here 
is applicable to other single-particle GFTs, such as self-interaction correction or Hartree-Fock 
theory, for concreteness we concentrate primarily on density functional theory (DFT). 
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This formalism is of particular interest to those on the forefront of exploring new ab 
initio techniques and novel applications of such in the physical sciences. It allows practition- 
ers to quickly introduce new physics and techniques without expenditure of significant effort 
in debugging and optimizing or in developing entirely new software packages. It does so by 
providing a new, compact, and explicit matrix-based language for expressing GFT calcula- 
tions, which allows new codes to be "derived" through straightforward formal manipulations. 
It also provides a high degree of modularity, a great aid in maintaining high computational 
performance. 

This language may be thought of as being for GFT what the Dirac notation is for 
quantum mechanics: a fully explicit notation free of burdensome details which permits the 
ready performance of complex manipulations with focus on physical content. Direct appli- 
cation of the Dirac notation to GFT is particularly cumbersome because in single-particle 
theories, the quantum state of the system is not represented by a single ket but rather a 
collection of kets, necessitating a great deal of indexing. Previous attempts to work with the 
Dirac notation while eliminating this indexing have included construction of column vectors 
whose entries were kets [[TJ but such constructions have proved awkward because, ultimately, 
kets are members of an abstract Hilbert space and are not the fundamental objects of an 
actual calculation. 

The foundation of the new DFT++ formalism is the observation that all the nec- 
essary computations in an ab initio calculation can be expressed explicitly as standard 
linear-algebraic operations upon the actual computational representation of the quantum 
state without reference to complicated indexing or to the underlying basis set. With tradi- 
tional approaches, differentiating the energy functional, which is required for self-consistent 
solution for the single-particle orbitals, is a frequent source of difficulty. Issues arise such 
as the distinction between wave functions and their duals, covariant versus contravariant 
quantities, establishing a consistent set of normalization conventions, and translation from 
continuum functional derivatives to their discrete computational representations. However, 
by expressing the energy explicitly in our formalism, all these difficulties are automatically 
avoided by straightforward differentiation of a well-defined linear-algebraic expression. 

This new formalism allows not only for ease of formal manipulations but also for 
direct transcription of the resulting expressions into software, i.e. literal typing of physical 
expressions in their matrix form into lines of computer code. Literal transcription of oper- 
ations such as matrix addition and multiplication is possible through the use of any of the 
modern, high-level computer languages which allows for the definition of new object types 
(e.g. vectors and matrices) and the action of the standard operators such as "+", or 
"*" upon them. Once the basic operators have been implemented, the task of developing 
and debugging is simplified to checking the formulae which have been entered into the soft- 
ware. This allows the researcher to modify or extend the software and explore entirely new 
physical ideas rapidly. Finally, a very important practical benefit of using matrix operations 
wherever possible is that the theory of attaining peak performance on modern computers is 
well developed for matrix-matrix multiplication. 

The high level of modularity which naturally emerges within the DFT++ formalism 
compartmentalizes and isolates from one another the primary areas of research in electronic 
structure calculation: (i) derivation of new physical approaches, (ii) development of effective 
numerical techniques for reaching self-consistency, and (iii) optimization and parallelization 
of the underlying computational kernels. This compartmentalization brings the significant 
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Figure 1: Overview of DFT++ formalism 



advantage that researchers with specialized skills can explore effectively the areas which 
pertain to them, without concerning themselves with the areas with which they are less 
familiar. 

A few anecdotes from our own experience serve to illustrate the efficacy of this ap- 
proach. The extension of our production software to include electron spin through the local 
spin-density approximation (LSDA) required a student with no prior familiarity with our 
software only one-half week to gain that familiarity, three days to redefine the software ob- 
jects to include spin, and less than one day to implement and debug the new physics. The 
time it took another student to develop, implement, explore, and fine-tune the new numerical 



technique of Section |6.2| was less than a week. Finally, our experience with parallelization 
and optimization has been similarly successful. To parallelize our software for use with an 
SMP (using threads) required a student starting with no prior knowledge of parallelization 
two weeks to develop a code which sustains an average per processor FLOP rate of 80% 
of the processor clock speed. (See Section [7.4. 1| for details.) Finally, for massively parallel 
applications, the development of an efficient DMP code (based on MPI) , a task which often 
requires a year or more, required two students working together approximately two months 
to complete. 
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2 Overview 



Figure [I] both illustrates the interconnections among the primary areas of active research 
in modern electronic structure calculations and serves as a road-map for the content of this 
article. The figure emphasizes how the DFT++ formalism forms an effective central bridge 
connecting these areas. 

Reduction to practice of new physical approaches generally requires expressions for an 
energy functional and the derivatives of that functional, as indicated in the upper-left portion 
of the figure. Our discussion begins in Section |3] with an exposition of the mathematical 
framework which we employ throughout this work, a Lagrangian formulation of generalized 
density functional theories. In Section |] we introduce our matrix-based formalism using 
density-functional theory (DFT) within the local-density approximation (LDA) as a case 
study, deriving the requisite expressions for the energy functional and its gradient. 

In Section |5], we go on to consider several examples of other functionals for physical cal- 
culations, including the local spin-density approximation (LSDA), self-interaction correction 
(SIC), density-functional variational perturbation theory, and band-structure calculations. 
We derive the requisite expressions for the corresponding functionals and their derivatives 
in the space of a few pages and thereby show the power and compactness of our formulation 
for the treatment of a wide range of single-particle quantum mechanical problems. 

As mentioned in the introduction, our matrix-based formalism allows the relevant 
formulae to be literally typed into the computer. Because these formulae are self-contained, 
we can make, as illustrated in the upper-right portion of the figure, a clear distinction between 
the expression of the physics itself and the algorithms which search for the stationary point 
of the energy functional to achieve self-consistency. For concreteness, in Section ^| we provide 
full specification for both a preconditioned conjugate- gradient minimization algorithm and 
a new algorithm for accelerating convergence when working with metallic systems. 

Due to our matrix-based formulation, the expressions for the objective function and 
its derivatives are built from linear-algebraic operations involving matrices. As the lower 
portion of Figure |T] illustrates, the DFT++ formalism clearly isolates the software which 
contains the actual computational kernels. These kernels therefore may be optimized and 
parallelized independently from all other considerations. 

Section [F] describes these computational considerations in detail. In Section |7[l| we 
discuss the use of object-oriented languages for linking the underlying computational kernels 
with higher level physical expressions. Section |7]2| discusses the scaling with physical system 
size of the burden for the most time consuming computational kernels. There are in fact 
two distinct types of computational kernels, both of which appear in the lower portion of 
the figure. 

The first type are kernels which implement those few operators in our formalism that 
depend on the choice of basis set (L, O, X, J, X^, J\ defined in Section fO|). These kernels 
represent the only entrance of basis-set details into the overall framework. (Appendix [A| 
provides the requisite details for plane- wave calculations.) This allows for coding of new 
physics and algorithms without reference to the basis and for a single higher-level code 
to be used with "plug-ins" for a variety of different basis sets. The application of the 
basis-dependent operators can be optimized as discussed in Section [0| by calling standard 
packages such as FFTW ||. Parallelization for the basis-dependent operators is trivial 
because they act in parallel on all of the electronic wave functions at once. Section 7.4 
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discusses such parallelization for SMP and DMP architectures. 

Finally, the second class of kernels are basic linear-algebraic operations (e.g. matrix 
multiplication '*', addition '+', subtraction and Hermitian-conjugated multiplication 
'"') which do not in any way depend on the basis set used for the calculation. As such, 
the work of optimization and parallelization for these kernels need only be performed once. 
Section [7^ presents the two strategies we employ for these optimizations: blocking of matrix 
multiplication and calling optimized linear-algebra packages such as BLAS3. Parallelization 
of these operations is not trivial because data-sharing or communication is required between 
processors. We detail high performance strategies for dealing with this issue in Sections [7^4 
for both SMP and DMP architectures. 



3 Lagrangian formalism 

The traditional equations of density-functional theory are the Kohn-Sham equations [@] for a 
set of effective single-particle electronic states {V'i (?")}■ Below, when we refer to "electrons", 
we are in fact always referring to these effective electronic degrees of freedom. The electro- 
static or Hartree field (f)(r) caused by the electrons is traditionally found from solving the 
Poisson equation with the electron density derived from these wave functions as the source 
term. The ground-state energy of the system is then found by minimizing the traditional 
energy functional, which ensures the self-consistent solution of the Kohn-Sham equations. A 
great advantage of this variational principle is that first-order errors in the wave functions 
lead to only second order errors in the energy. However, although not frequently emphasized, 
errors in solving the Poisson equation due to the incompleteness of the basis set used in a 
calculation may produce a non- variational (i.e. first-order) error in the energy. 

We now consider a new variational principle which ultimately leads to identical results 
for complete basis sets, but which places {ipi} and on an equal footing and has several 
advantages in practice. The central quantity in this principle is the Lagrangian Clda intro- 
duced in [|J, which within the local-density approximation (LDA), is 

/WM(r)},0(r)) = -lj2fijd 3 r^(r)y 2 Mr) 

+ J d 3 r V ion {r)n{r) + J d 3 r e xc {n{r)) n(r) 

-Jd 3 r<P(r)(n(r)-n )-±- J d 3 r || V0(r) || 2 , (1) 

where the electron density n(r) is defined in terms of the electronic states and the Fermi- 
Dirac fillings fi as 

n(r) = S;/ < ||^(r)|| 2 . (2) 

i 

Here and throughout this article we work in atomic units and therefore have set h = m e = 
e = 1, where m e is the electron mass and e is the charge of the proton. Finally, the Kohn- 
Sham electronic states {ipi} must satisfy the orthonormality constraints 

Jd 3 r^(r)^(r) = 5 tl . (3) 
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Above, e xc (n) is the exchange-correlation energy per electron of a uniform electron 
gas with electron density n, and Vi on (r) is the potential each electron feels due to the ions. 
The constant n^ is used in calculations in periodic systems as a uniform positive background 
that neutralizes the electronic charge density The effect of this background on the total 
energy is properly compensated when the Ewald summation is used to compute the interionic 
interactions. 

The following equations, subject to the constraints of Eq. specify the stationary 
point of C L da, 



1 5C 



LDA 



8£lda 



6(f>(r) 








-\v 2 + V ion (r) - 0(r) + V xc (r) 
-(n(r) - n ) + T '-V 2 0( r ) . 

47T 



if>i(r) - eiipi{r) 



(4) 
(5) 



These are seen to be the standard Kohn-Sham eigenvalue equations for ipi{ r ) and the Poisson 
equation for the Hartree potential 0(r) where the negative sign in the second equation 
properly accounts for the negative charge of the electrons. 

The behavior of Clda is in fact quite similar to that of the traditional LDA energy 
functional, 



E 



LDA 



\ £/< / d3r ^(r)Vfyi(r) + J d 3 r V lon (r)n(r) 



+ / d r e xc {n{r)) n{r) + - d A r d 6 r 



3 n(r)n(r') 



(6) 



First, as shown in ||, evaluation of Clda ({V ; j( r )}) <K r )) > where is the solution of the 
Poisson equation, recovers the value of the traditional energy functional. Moreover, the 
derivatives of Clda and Eld a are also equal at 0. This result follows by considering a 
variation of the equality E L da ({?M r )}) = £>lda ({"0«( r )}) 0( r )) j which expands into 



5 E L d a , , , ^ 5E L da x , * / v 



$£-LDA x , / \ . &C-LDA r ,*/ \ 



+ / d 3 r 



5£ 



LDA 



5<f>(r) 



6if>(r) 



Because Poisson's equation (Eq. is the condition that the functional derivative of Clda 
with respect to <fi vanishes, the last term on the right-hand side is zero when = 0. Therefore, 
the functional derivatives of E L da and Clda with respect to the electronic states {ipi} 
are equal when evaluated at 0(r). Finally, the critical points of Clda are in one-to-one 
correspondence with the minima of Eld a- The reason is that (i) for fixed {ipi}, there is a 
unique (up to a choice of arbitrary constant) which solves the Poisson equation because 
C lda as a function of is a negative-definite quadratic form, and (ii) as we have just seen, 
at such points the derivatives with respect to {ipi} of Eld a and Clda are identical. 

One advantage of placing and {ipi} on an equal footing is that now errors in the 
Lagrangian are second-order in the errors of both the wave functions and the Hartree field. 
Additionally, as a practical matter, one has greater flexibility in locating the stationary 
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point. Rather than solving for the optimal <fi at each value of {ipi}, as is done in traditional 
DFT methods, one has the option of exploring in both {ipi} and simultaneously. However, 
some care in doing this is needed, because the stationary points of Clda are not extrema 
but saddle points. (Note that the first term of Eq. ([[]), the kinetic energy, is unbounded 
above, whereas the last term, the Hartree self-energy, is unbounded below.) This saddle has 
a particularly simple structure, and a method to exploit this is outlined in ||. 

Finally, by allowing cf) to be a free variable, we have rendered local all interactions 
among the fields. One great formal advantage is that the subtle mathematical issues in 
periodic systems arising from the long-range nature of the Coulomb interaction no longer 
require special treatment. For example, the choice of the neutralizing background no in 
periodic systems is straightforward and is treated in detail in Section |4.3.5| . Because of this 
and the aforementioned advantages, we will work in the Lagrangian framework throughout 
this article. 



4 Basis-set independent matrix formulation 

Our basis- set independent matrix formalism allows us to express the structure of any single- 
particle quantum theory in a compact and explicit way. In this section, we apply it to 
the Lagrangian of Eq. ([!]) which contains energetic terms and non-linear couplings that are 
common to all such theories. 

To make progress, we first must deal with the fact that the Lagrangian is a function of 
continuous fields. When we perform a computation, we are forced to represent these fields in 
terms of expansions within a finite basis set. Denoting our basis functions as {b a (r)}, where 
Greek letters index basis functions, we expand the wave functions and Hartree potential in 
terms of expansion coefficients C a i and (f) a through 

A(r) = J2 b a{r)C ai , 0(r) = J> a (r)0 a . (7) 

a a 

Typical and popular choices of basis functions are plane waves (i.e. Fourier modes) [[7], 
finite-element functions Q, multiresolution analyses ||, or Gaussian orbitals |J. Once a 
basis set has been chosen, Clda becomes an explicit function of the finite set of variables 
C a i and 4> a . 

In addition to the basis set itself, we require a grid of points p in real space covering 
the simulation cell. This grid is necessary for a number of operations, such as for computing 
the values of the wave functions or the electron density in real-space and for computing the 
exchange-correlation energy density e xc (n(r)) of Eq. ([!]), which is a non-algebraic function of 
the electron density n(r) and can only be computed point by point on the real-space grid. 

Our aim is to find a compact, matrix-based notation that works in the space of 
expansion coefficients C a i and (f) a and is thus applicable to any calculation within any basis 
set. In the course of doing so, we will be able to clearly identify which parts of our formalism 
require information about the particular basis that is chosen and which parts are completely 
general and independent of this choice. In addition, when we have arrived at a matrix- 
based notation, it will be clear that only a few fundamental types of computational kernels 
are needed to perform the calculation, so that parallelization and optimization need only 
address themselves to these few kernels. This provides a great boon for portability, ease of 
programming, and extensibility to new physical scenarios. 
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In the discussion below, we describe our formalism only for the case of local ionic 
potentials. The use of non-local potentials (e.g. for the important case of pseudopotential 
calculations) results in only minor changes that are addressed in Appendix |B[ Furthermore, 
for periodic systems, we will be working with only a single fc-point at k — 0, as is evident 
from the choice of expansion in Eq. (|7]). We work at k = in order to keep the mathematical 
expressions as transparent as possible. The minor extensions required to accommodate non- 
zero and multiple fc-points are straightforward and are dealt with in Appendix |C|. 

4.1 Basis-dependent operators 

In this section we describe all the operators in our formalism that depend on the basis set 
chosen for the calculation. We will see that there are a small number of such operations, 
and that we can easily separate their role from the rest of the formalism. 

The first two operators involve matrix elements of the identity and the Laplacian 
between pairs of basis functions. Specifically, we define 

Oa,p = fd 3 rb* a (r)b p (r), (8) 
L a ,p = Jd 3 rb* a (r)V 2 b p (r). (9) 

We call these operators the overlap and Laplacian respectively. Note that for orthonormal 
bases, we have O = I where I is the identity matrix. 

The integrals of the basis functions are the components of the column vector s, 

s a = J d 3 r b* a (r) . (10) 
For periodic systems, we use the vector s to define a new operator O through 

(id 

where Q is the volume of the periodic supercell. For calculations in systems without bound- 
aries, the volume Q is infinite so that O = O. The chief use of O is for solving the Poisson 
equation in periodic systems where divergences due to the long-range Coulomb interaction 
must be avoided. The automatic avoidance of such divergences and the proper choice of no 
in Eq. (P both follow directly from the nature of the Lagrangian as will be discussed in 
Section [4.3.5 . 

The next four operators involve the values of the basis functions on the points p of 
the real-space grid introduced above. The forward transform operator I allows for changing 
representation from the space of expansion coefficients to the space of function values on the 
real-space grid. Specifically, given a basis function a and a grid point p, we define 

1 va = b a {p) . (12) 

Thus the ath column of X consists of the values of the ath basis function on the points of 
the real-space grid. 
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Next, it is at times necessary to find the expansion coefficients for a function given its 
values on the real-space grid. We denote this linear inverse transform by J . In implementa- 
tions where the number of grid points p is equal to the number of basis functions, the natural 
choice is to take J = X -1 (e.g., plane- wave basis sets). However, this is not necessary: in 
some applications, one may choose to use a very dense real-space grid which has more points 
than the number of basis functions. Hence, we keep the formal distinction between J and 
X -1 . We will also require two conjugate transforms, which are the Hermitian conjugates X^ 
and Jt. 

The final mathematical object that depends on the basis set involves the ionic poten- 
tial Vi on {r). We define a column vector Vi on whose components are the integrals 



(V lon ) a = J d'r b* a (r)V lon (r) , (13) 
which encodes overlaps of the ionic potential with the basis functions. We will use Vi on when 



evaluating the electron-ion interaction energy in Section (4.3.2 . 



4.2 Identities satisfied by the basis-dependent operators 

Although the operators O, 6, L, X, J, X\ and depend on the choice of basis, they satisfy 
various identities which will prove important below. In addition to their formal properties, 
these identities allow for verification of the implementation of these operators. 

The most important identity involves the constant function. To represent the constant 
function on the grid, we define the column vector 1 as having the value of unity on each grid 
point p: l p — 1. Many basis sets can represent this function exactly (e.g. plane waves or 
finite-element sets). For such bases, for all points r in the simulation cell, we must have the 
identity 

£(.71)«&«(r) = l. (14) 

a 

Evaluating this identity on the real-space grid yields 

TJ1 = 1 . (15) 



For basis sets that can not represent the constant function exactly, the identity of Eq. ( |T5D 
and the ones below should hold approximately in the regions described by the basis. 

According to Eq. (0), the vector J\ specifies the coefficients of the expansion of the 
constant function. Using the integrals s of Eq. (fTOj) , we can see that 




d r b* Q (r) 

= Jd 3 rb* a (r)(^2(Jl)pb 
= (OJl) a . 

Thus we have that s = OJ1. We can also derive the normalization condition 

s ] Jl = J d?r J2K(r)(Jl) a = Jd 3 rl = n, (16) 
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where f2 is the volume in which the calculation is performed. 

When solving Poisson's equation for the electrostatic potential (Eq. @), we must 
take special care regarding the null space of the Laplacian operator L, which is the space of 
constant functions. Integrating the identity V 2 1 = against the complex conjugate of each 
basis function yields 

LJ1 = 0. (17) 

We use this identity when dealing with the Poisson equation in periodic systems to avoid 
divergences due to the long-range nature of the Coulomb interaction. 



4.3 Basis-independent expression for the Lagrangian 

We now use the above operators to express the Lagrangian of Eq. (|l|) in a matrix-based, 
basis-independent manner. We begin by introducing two operators, "diag" and "Diag" . The 
operator diag converts a square matrix M into a column vector containing the diagonal 
elements of the matrix. The operator Diag converts a vector v into a diagonal matrix with 
the components of v on its diagonal. In terms of components, we have that 

(diag M) a = M aa , (Diag v) af3 = v a 5 al3 , (18) 

where 5 is the Kronecker delta. Thus, diag Diag v = v for any vector v whereas Diag diag M 
= M if and only if M is a diagonal matrix. Two useful identities involving these operators 
are 

(diag M) ] v = Tr(M t Diag v) , v j (diag M) = Tr( (Diag v ) f M) , (19) 

where ' indicates Hermitian or complex-conjugated transposition. 

Next, if we regard the expansion coefficients C a i as a matrix whose ith column contains 
the expansion coefficients of the ith wave function (Eq. (|?p), and we also define the diagonal 
matrix of Fermi fillings i*y = fiSij, it is easy to see that 

p = CFC ] (20) 

is the representation of the single-particle density matrix in the space of basis functions. 

Before considering the Lagrangian itself, we will also need expressions for the electron 
density n(r) which appears in most of the terms of the Lagrangian of Eq. ([!]). We define 
a vector n whose components are the values of the electron density on the points p of the 
real-space grid. Specifically, 



n v = n{p) = J2M\MP)\\ 2 = 12M\( IC ) 




i i 

= e to; fi ( ic )pi = {( ic ) F ( ic y) pp . 

i 

whence we arrive at the identity defining the vector n 

n = diag (jPJ f ) . (21) 

Given the values of the electron density on the real-space grid, we use the inverse transform 
J to find the expansion coefficients of n(r) in terms of the basis functions. This vector of 
coefficients is just Jn. 

Armed with these few tools, we now proceed to write the various energetic terms of 
the Lagrangian in the matrix language developed above. 
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4.3.1 Kinetic energy 

The kinetic energy T can be transformed into the matrix language by using the expansion 
coefficients C of Eq. (fj]) and by using the definition of the Laplacian L of Eq. ([|): 

i ' i ct,/3 

= -±Tr (f&LC) =-^Tr (LP) , (22) 

where the last two equivalent expressions are related by the cyclic property of the trace. 
Thus, we are able to write the kinetic energy explicitly as a function of the density matrix 
P of Eq. (|0|). 



4.3.2 Electron-ion interaction 

Since the electron density n(r) is real, we may write the electron-ion interaction as 

*U = [d 3 m*(r)V ton (r) = J2 [d 3 r(JnTab* a (r)V ton (r) 

= ( Jn) V io „ = Tr (jt [Diag jV io „] TP) , (23) 

where we have used the definition of V ion from Eq. (|13|) and have used Eqs. ([21]) and ( |19|) to 
rewrite this interaction in terms of P. 



4.3.3 Exchange-correlation energy 



Given the vector n of electron-density values on the grid, we can evaluate the exchange- 
correlation energy per particle at each grid point p through e xc {n(p)). We collect these 
values into a vector e xc (n). We then inverse transform this vector and the electron density 
vector, and we use the overlap operator to arrive at 



E X c — 



d 3 r n*(r) e xc (n(r)) 
(Jn) ] O (Je xc (n)) = Tr (rf [Diag J ] OJe a 



n 



IP 



(24) 



where we again have conjugated the electron density for ease of formal manipulations. The 
derivation of the final expression in terms of P uses Eq. (|2~T]). 



4.3.4 Hartree self-energy 

The self-energy of the Hartree field can be written as 

Eh-h = ~l d 3 r ||W(r)|| 2 = i [d 3 r 0*(r)V 2 0(r) = ^L<j> , (25) 

o7T J o7T J 57T 

where we have first integrated by parts and then substituted the expansion coefficients (f) of 
Eq. (0). The complex conjugation of the real- valued function 0(r) allows for the simplicity 
of the final expression. 
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4.3.5 Electron-Hartree interaction 



The interaction of the electron density n(r) and Hartree potential <p(r) can be written as 

E e -H = - fd 3 r (n(r) - n )* 0(r) = - [J(n - n l)] j 0<P . (26) 



The proper choice of Uq for periodic systems can be found by noting that the Hartree self- 
energy E H _ H of Eq. ( p5D has no dependence on the projection of onto the null space of 



L which, as we saw in Section [4.2|, lies along the vector Jl. Thus, for the Lagrangian of 
Eq. ([[]) to have a saddle-point, there can be no coupling of n(r) — no with the projection of 
4> along Jl. That is, we must have [J(n — UqI)] O ■ Jl = 0. The identities of Section [12] 
then lead to the choice n = (Jn)^ s/VL. Our final expression for E e _n is thus given by 



E, 



e-H 



■{J-nfO<j> = -Tr (X f [Diag J*0<f> 



IP 



(27) 



4.3.6 Complete Lagrangian 



Summing all the contributions above, we arrive at two equivalent expressions for the La- 
grangian Clda, 



C 



LDA 



-^Tr (FCftLC) + (Jn) j [v ion + OJe xc (n) - 0<P 
— Tr {LP) + — f L0 



+ ^L4> 

87T 



in 



+ Tr (jtDiag [j ] V ion + J^OJe xc (n) - J^6<f>] IP) 



(28) 



(29) 



The first, compact form is computationally efficient for evaluating the Lagrangian as a func- 
tion of C and (p. The second form, written as a function of the density matrix P, finds its 
best use in the formal manipulations required to find the gradient of the Lagrangian. 



4.4 Orthonormality constraints 

The orthonormality constraints of Eq. (^) are equivalent to the matrix equation 

C ] OC = I. (30) 

If we wish to compute gradients of the Lagrangian with respect to C in order to arrive at 
the Kohn-Sham equations, we must do so while always obeying these constraints. 

The analytically-continued functional approach deals with these constraints by 
introducing a set of expansion coefficients Y which are unconstrained and which can have 
any overlap U, 

U = Y ] OY. (31) 

We also allow for the possibility of subspace rotation, which is a unitary transformation 
mapping the subspace of occupied states {ipi} onto itself. Such a transformation is affected 
by a unitary matrix V, and we parameterize V as the exponential of a Hermitian matrix B 
through 



V 



JB 



where = B . 



(32) 
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The coefficients C are defined as dependent variables through the mapping 

C = YU- l/2 V\ (33) 

which ensures that Eq. (|3~0|) is automatically obeyed, as is easy to verify by direct substitution. 
The density matrix P takes the following form in terms of Y and V, 

p = CFC ] = YU- 1/2 V ] FVU~ 1/2 Y^ . (34) 

In most cases, we simply set V — I. In fact, for the case of constant fillings, F = fl, the 
unitary matrix V drops out of P completely. The subspace rotations find their primary use 
in the study of metallic or high-temperature systems where the Fermi-Dirac fillings are not 
constant, and the rotations allow for greatly improved convergence rates when searching for 
the saddle point of the Lagrangian. This point is explained in more detail in Section |6|. 



4.5 Derivatives of the Lagrangian 

Since the most effective modern methods that search for stationary points require knowledge 
of the derivative of the objective function, we will now find the derivative of the Lagrangian 
of Eq. (p5[) or (p5|) with respect to the variables Y and (and B if subspace rotations are 
used). Differentiation with respect to Y is far more complex due to the orthonormality 
constraints, and we begin with this immediately. 



4.5.1 Derivative with respect to the electronic states 

Computing the derivative of the Lagrangian with respect to Y is intricate, and we break the 
problem into smaller pieces by first finding the derivative with respect to the density matrix 
P. Once the derivative with respect to P is found, we can use the relation between P and 
Y (Eq. (|3|)) to find the derivative with respect to Y . 

We begin by noting that except for the exchange-correlation energy, the entire ex- 
pression of Eq. ( |29| ) is linear in the density matrix P. The exchange-correlation energy is a 
function of the electron density n, which, through Eq. (pi]), is also a function of P. Thus 
if we consider a differential change dP of the density matrix, the only term in djC,LDA that 
needs to be considered carefully is 

n)j^Ojd[e xc {n)) = rJj^OJ [Diag e' xc {n)) dn 

= Tr {l+Diag ([Diag 4.(71)] J ] OJn)ldP) . 



In the above derivation, we have used Eq. ( pT|) to relate dn to dP as well as the identities 
of Eq. (]ToD . The vector 4( n ) is given by its values on the real-space grid points p via 

(4H) P = 400))- 

We can now write the differential of the Lagrangian of Eq. (p9| ) with respect to P as 

dC LDA = Tk^—LdP+X^Bi^lj^ + J^OJe^n) 

+ [Diag e' xc {n)]J^OJn - J^0<p]ldP} 
= Tr (HdP) , (35) 
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where the single-particle Kohn-Sham Hamiltonian operator H is given by 



H = -^L + rf [Diag V sp ] I , where 

V sp = J^V ion + J j OJe xc (n) + lDmge' xc (n)]J^OJn-J^6(j). (36) 

The single-particle Hamiltonian is the sum of a kinetic operator and a local single-particle 
potential V sp (a vector of numbers on the real-space grid specifying the values of the poten- 
tial). 

Eq. ( |35f) has conveniently separated out the physical description of the system, the 
Hamiltonian H, from the variation dP which we now compute. Differentiating the relation 

JJ-X/2JJX/2 = we find that 

dp- 1 / 2 ] = -u-^dp^p- 1 / 2 , 

and we use this to express the variation of the density matrix of Eq. ([34|) as 

dP = (dY)U- 1/2 V^FVU- 1/2 Y^ + YU- 1/2 V^FVU- 1/2 (dY^) 

- YU- 1 ' 2 (dp l l 2 p- l l 2 V^FV + V^FVU- l l 2 dp 1 ' 2 ]) U- l ' 2 Y\ 



We now substitute this expression for dP into Eq. ([35|). We use the definition of the operator 
Q (Eq. (^SJ) of Appendix g), its relation to dp l/2 } (Eq. @), and the identities which Q 
satisfies (Eqs. (|56|)). After some manipulations involving the cyclicity of the trace, we arrive 
at 



d£ L DA = Tr 
' dC LDA \ 

art J 



, where 



/ - OCC ] ) HCFVU- 112 + OCVQ (v ] [H, F]v) , and 

H = C^HC, (37) 

where H is the subspace Hamiltonian and contains matrix elements of the Hamiltonian H 
among the wave functions {ipi(r)}. Square brackets denote the commutator, [a, b] = ab — ba. 
Physical interpretation of the terms in Eq. (|37|) is provided in Section |4.6| . 

Finally, since Y and Y' are not independent, we can simplify the expression for the 
differential of Clda to 

9Clda 



dC-LDA = 2 Re Tr 



dY ] 



art 



where Re denotes the real part of its argument. 



4.5.2 Derivative with respect to the Hartree field 

Since the Lagrangian in Eq. (p8|) is quadratic in 0, its derivative with respect to <fi may be 
readily calculated. However, to arrive at a symmetric expression for the derivative in terms 
of 4> and (f>\ we note that the linear dependence on 0, given by z = [J'nyOcf), is a real 
number because both n(r) and <p(r) are real in Eq. (p6[) . For convenience, we rewrite this 
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as (z + z*)/2, which is an equivalent expression symmetric in and (ft. By using this, we 
compute the variation of Clda and arrive at 



dCiDA 

' dC LDA \ 

, w ) 



f f dC LDA \ | 



'dC 



LDA 



d<ft 



where 



I 07T 



(38) 



Again, since and (ft are not independent, we can express the variation as 

dC LDA = 2 Re 



t / OClda 



4.5.3 Derivative with respect to subspace rotations 

We have parameterized the unitary matrix V of Eq. (|33"D by the Hermitian matrix B of 
Eq. (|3~2"D as V = e tB . We will now find the derivative of Clda with respect to B. First, we 



consider the variation of Clda with respect to those of V and V* by using Eq. fl3~5]) as our 
starting point. Using the definition of P in Eq. (|34|), we have that 



dC 



LDA 



Tr {HdP} 

Tr [H'[dV ] FV + U f iW]} , 



where H' = ET^/ayl HYU' 1 ' 2 . Differentiating the identity VW 
— V^dVV^ which allows us to write 



/ leads to dV^ 



dC LDA = Tr{[H,F]dVV^} 



where again H = C^HC is the subspace Hamiltonian. 

We place the eigenvalues of B on the diagonal of a diagonal matrix (3 and place the 
eigenvectors of B in the columns of a unitary matrix Z. Thus B = Zf3Z^ and Z^Z = ZZ^ = 
I. We now use the result of Eq. ( [53T) of Appendix || applied to the case V = f(B) = e tB to 
arrive at the following result relating dV to dB: 



{ZUVZ) nm = {ZUBZ) n 



Pm 0n 

Using this and the fact that = Ze~ l "Z\ we have that 
dC LDA = Tr {[H,F]Z{Z*dVZ)ZW} 



if m — n 
if m n 



Tr {Z^[H , F}Z(Z^dVZ)e~ ip ) 
^2(Z^[H,F}Z) nm (Z^dBZ) mn 



Pm Pn 



if m = n 
if m 7^ n 
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We define the operator R(A) acting on a general matrix A via 



{Z^R(A)Z) nm = {Z^AZ), 



i if m = n 



if m 7^ n 



This allows us to write the variation of Cloa as 

dC LDA = Tr [Z^R ([H, F]) ZZ^dBZ} = Tr {R ([H, F]) dB} 
so that the derivative of Clda with respect to B is 

9Clda 



dB 



R([H,F]). (39) 



4.6 Kohn-Sham and Poisson equations 

The Kohn-Sham and Poisson equations (Eqs. @ and (P)) are obtained by setting the deriva- 
tive of the Lagrangian with respect to Y and <p to zero. This results in the two equations 



'dC 



LDA 



) = = (/ - OCC ] ) HCFVU- 1 ' 2 + OCVQ(V^[H, F]V) , (40) 



'dC LDA \ I . , I 



= --OJn + — L0. (41) 



V 90t y 2 8tt 

Eq. PD[) states the stationarity of the Lagrangian with respect to variations of the wave- 
function coefficients Y, and we examine it first. 

We define the projection operator p = OCC^ which satisfies p 2 = p and which projects 
onto the subspace of occupied states {ipA- used in the calculation. Its complement p = I — p 
projects onto the orthogonal subspace spanned by the unoccupied states. By multiplying 



Eq. ( f40"D on the left by p and assuming that none of the Fermi fillings are zero, we find that 

pHC = . 

This reproduces the well known condition that at the stationary point, the Hamiltonian must 
map the occupied subspace onto itself. 

Conversely, we can project Eq. ( |40"D onto the occupied subspace by multiplying on 
the left by C'. This, combined with the fact that Q is an invertible linear operator, leads to 
the condition 

[H,F] = 0. 

Given that F is a diagonal matrix, for arbitrary fillings, the subspace Hamiltonian H also 
must be diagonal: the states {ipi} must be eigenstates of H with eigenvalues e^, as we have 
explicitly written in Eq. (|4]). However, if a pair of states ipi and ipj have degenerate fillings, 
fi — fji then Hij need not be zero. Converting such degenerate cases into the conventional 
diagonal representation requires a further unitary rotation, which, however, is not required 
for stationarity. 

The second condition for stationarity, Eq. (|4T|), rearranges into 

L<p = AnOJn . (42) 
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We have arrived at the Poisson equation for the Hartree potential generated by the electron 
density n (the negative electronic charge is reflected by the positive coefficient of the right- 
hand side). Since we have explicitly projected out the null-space of L from the right-hand 
side, we may invert L and find the solution to the Poisson equation 



4> = 4:7iL~ 1 OJn. (43) 

Finally, if we substitute the result of Eq. (f43|) for into our Lagrangian, we find the 
LDA energy functional (cf. Eq. (§)): 



V lon + OJe xc (n) 



-O UnL^OJn 
2 



E LDA = -^Tr (F&LC) + {Jn) ] 
4.7 Expressions for Lagrangian and derivatives: summary 



(44) 



We now collect the expressions for the LDA Lagrangian and its derivatives in one place. As 
we will see in Section [7| formulae in the DFT++ notation translate directly into lines of 
computer code, so that we will also be specifying the computational implementation of the 
Lagrangian. In addition, given the Lagrangian and its derivatives, we can apply any suitable 
algorithm to find the stationary point (Section 
The key expressions are 



C LDA (Y,<P,B) = 


-~Tr (F&LC) + {Jn) ] [V ion + OJe xc (n) - 00 




9Clda 


(/ - OCC ] ) HCFVU- 1 ' 2 + OCVQ (v\H, F]v] 




9£lda 






H = 


--L + X f [Diag V sp ] X , H = C^HC . 





As discussed in Section || the value of Clda and its Y and B derivatives are equal to 



the value and respective derivatives of the energy E LDA of Eq. (|44]) when we evaluate the 
Lagrangian-based quantities at the solution of the Poisson equation. Therefore, the above 
expressions can also be used to find the derivatives of E^da, a fact that we will exploit in 
Section || 



5 DFTH — |- specification for various ab initio techniques 

In the previous section, we presented a detailed derivation of the expression for the LDA 
Lagrangian and its derivatives in the DFT++ formalism. We believe that the community 
of physicists and chemists using this and other general-functional methods should use this 
formalism for the communication of new energy functionals and comparisons among them. 

From a physicist's or chemist's viewpoint, which we adopt in this section, linear 
algebra and matrices are the settings in which quantum mechanical computations must be 
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performed. Therefore, they are the fundamental objects in the new formalism. This is 
in contrast with the Dirac notation, which while an excellent conceptual tool for studying 
quantum problems, can never be used to carry out an actual calculation: matrix elements 
of bras and kets must first be found before an actual computation can proceed. 

Once an expression for an energy functional is found, its derivative is found by 
straightforwardly differentiating it with respect to the matrices of independent variables. 
Armed with expressions for the functional and its derivative, the methods discussed in Sec- 
tion |6] can then be applied to achieve self-consistency. 

In this spirit, we now present a few examples of energy functionals. Some are exten- 
sions of the LDA, while others are similar to the LDA only in that they involve the study of 
single-particle systems. In all cases, our aim will be to show how quickly and easily we can 
find the requisite expressions for the appropriate functional and its derivative. 



5.1 Local spin-density approximation (LSDA) 

The most straightforward generalization of the LDA approximation is to allow for spin-up and 
spin-down electrons to have different wave functions but to still treat exchange-correlation 
energies in a local approximation. Specifically, the exchange-correlation energy per particle 
at position r is now a function of both the spin- up and spin-down electron densities, n-f(r) 
and ni(r), and the total LSDA exchange-correlation energy is given by 



E xc = J d r n(r) e xc (n^(r), n^r)) , 
where n(r) = n^(r) + n^(r) is the total electron density. The LSDA has been found to show 



since 



substantial improvements over the LDA for atomic and molecular properties [T(J, [IT 
the spin of the electrons is explicitly dealt with. 

The changes required in the expressions of the Lagrangian and its derivatives in order 
to incorporate the LSDA are straightforward and easy to implement. We label spin states by 
cr, which can take the value j or j. We have spin-dependent expansion coefficient matrices 
G a for the wave functions (cf. Eq. (|7|)). Each spin channel has its own fillings F a and density 
matrix P a , 

P a = C a F a Cl. 

The electron densities n a and the total electron density n are given by (cf. Eq. ([21])) 

n a = diag(XP CT X^) and n = . 

The LSDA Lagrangian is given by 
C LS DA(C h C ; , 0) = -\ X; Tr (F a ClLC a ) + {Jn) ] [v ion + Oje xc (n h 71J - 6<j>] + -^£0. 

The orthonormal expansion coefficients C a are found from unconstrained variables Y a via 

C a = Y a U- l/ \ where U a = YjOY a , 
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where, for simplicity, we have set subspace rotations to unity, V a = I. Finding the derivatives 
of the Lagrangian with respect to the coefficients Y a follows the analysis of Section [4.5. 1| . 
Each spin channel has a single-particle Hamiltonian H a given by 



H a = + jt[Diag V a ]l , where 

V a = J^V ion + J f OJe xc (n hni ) + Diag 



<9e xc (n T 



J ] OJn - J^O<j>. 



drier 

The derivative of the Lagrangian with respect to Y a (cf. Eq. flT7|)) is given by 

I 9Clsda 
{ dY} ) 

In summary, we have the following expressions for the LSDA Lagrangian and deriva- 
tives 



-) = (l-OC a Cl)H a C a F a U;y 2 + OC a Q{[H a ,F a }), where H a = ClH a C a . 



C LSDA {Y h Y il( j>) = -i^Tr(F CT CtLC (T ) + ^0 t ^ 



dC 



LSDA 



+ {Jn) ] [V ion + OJe xc {n h - Ocf) 
I - OC a Cl) H a C a F a U- 1/2 + OC a Q([H a , F a \) 



dYj 

r, ti = —;Vjn + —L4>. 

C/0T 2 8ll 



5.2 Self-interaction correction 

The LDA and LSDA exchange-correlation functionals suffer from self-interaction errors: the 
functionals do not correctly subtract away the interaction of an electron with its own Hartree 
field when the electron density is not uniform. Perdew and Zunger [|12|] proposed a scheme 
to correct for these errors (the SIC-LDA which we simply refer to as SIC below). 

The idea is to subtract the individual electrostatic and exchange- correlation contri- 
butions due to the density n.j(r) = H^M?")!! 2 of each quantum state ipi(r) from the LDA 
functional. This procedure has the virtue of yielding the correct result for a one-electron 
system as well as correcting for the Hartree self-interaction exactly. In terms of our formal- 
ism, we define the density matrix Pj and electron density rii for the state i and relate them 
to the total density matrix P and total electron density n through 

Pi = Ceifieltf , P = £P ; n, = diag(XP i jt) , n = ^> , 

i i 

where is the column vector with unity in the ith entry and zero elsewhere. In addition to 
the total Hartree field 0, we also introduce Hartree fields (pi for each state i, and the SIC 
Lagrangian takes the form 

Csjc{C,{^}) = —TrfFC^L^ + iJn^lv^ + OJe^-O^+^LcP 

2 v 7 L J 57T 

- E(^) f \OJe xc (n t ) - Ofa] - ^ E tiWi ■ 
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Setting the variation with respect to fa and <fi to zero (cf. Section |4.6| ) results in the 
Poisson equations 

Lcj) = AnOJn , Lfa = AnOJn t . 
Substituting these solutions into the SIC Lagrangian yields the familiar SIC energy 



E SW {C) = --Ti(FC'LC) + (J n y 



11; 



V ion + OJe xc (n) 
OJe xc ( ni ) - X -O^L- x OJn r 



1 « 



0(AnL- L OJn 



The derivatives of the SIC Lagrangian with respect to the density matrices Pi generate 
state-dependent Hamiltonians Hi and state-dependent potentials Vj given by 



Hi 

Vi 



-\l + T ] [Diag(K P -^)]J, 

J ] OJe xc {m) + [Diag e' xc {m)} J ] OJn t - J^Ofa , 



where the state-independent potential V sp is that of Eq. (|3q). The derivation of the expression 
for the derivative of the Lagrangian with respect to Y follows precisely the same steps as in 



Section [4.5.1| , and the final form is 
/ d£sic\ 



/ - OC(f) (j2 HiPeifieU fT 1/2 + OCQ [H, e^e}] J 



Hi 



C^HiC . 



An examination of this form shows that to compute the derivative, each Hamiltonian Hi 
need only be applied to the ith. column of C (as the product Ce« occurs in all places), so 
that computation of the derivative is only slightly more demanding than the corresponding 
LDA derivative. 

The above results for the derivative are a generalization of those in [jn|. Those 
authors, however, work in the traditional real-space representation (where necessarily all the 
sums over indices appear) and, at each step of the minimization, orthonormalize their wave 
functions, so that their expressions are a special case of ours when U — I. 

The summary of the SIC Lagrangian and derivatives follows. 



C SIC {Y,fa{fa}) 



-^Tr (FC ] LC) + (Jn) ] 



Vion + OJe xc {n) - 0<P 



-E(^) f \0 Je xc (n t ) - 6 fa 



+ TrfaLcp 



57T 



dC 



sic 



art 

dCsic 



/ - OC&) {^2 HiCeJie^j U^ 2 + OCQ ^£ [h u e^e} 

--OJn + —LS , = -OJni - — Lfa . 

2 8tt ' d<p\ 2 1 8tt Vl 
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5.3 Band-structure and fixed Hamiltonian calculations 



Very often, we aim to find a set of quantum states {ipi} that are the lowest energy eigenstates 
of a fixed Hamiltonian. One case where this occurs is in the calculation of band structures 
for solids within DFT, where one has already found the stationary point of the Lagrangian 
and the optimal electron density n(r). One then aims to explore the band structure for 
various values of /c-vectors. (See Appendix |C| for a full discussion of fc-points.) This requires 
finding the lowest energy eigenstates of the Hamiltonian. The problem is the same as a 
tight-binding calculation in the sense that the Hamiltonian is fixed and the electronic energy 
of the system is sought after, i.e. the minimum of the expectation value of the Hamiltonian 
among an orthonormal set of states. In both cases, the approach described below is most 
useful when the number of basis functions is much larger than the number of states {ipi} so 
that direct diagonalization of the Hamiltonian is computationally prohibitive. 

In such cases, we have a Hamiltonian H, and we expand our wave functions as shown 
in Eq. (0). We must minimize the energy E 

E = Tr(C^HC) . 

We introduce unconstrained variables Y in the same way as before (Eq. ([H]) and onwards). 
The variation of the energy is given by 

dE = Tr{H d\YU~ l Y^)) = 2ReTr 

where the derivative of E is 

1 (/ - OCC^HCTJ- 1 ' 2 




\dY\ 

When we are at the minimum of E, we have an orthonormal set C that spans the subspace of 
the lowest-energy eigenstates of H. The minimum value of E is the electronic energy for the 
case of a tight-binding Hamiltonian. If the energy eigenvalues and eigenvectors are desired, 
we diagonalize the subspace Hamiltonian H = 0HC to obtain the eigenvalues e. We then 
use the unitary matrix S which diagonalizes H, H = S*(Diag e)^, to find the expansion 
coefficients for the eigenstates, given by the product CS. The summary of key equations 
follows. 



E(Y) = Tv(C^HC) , 

H = (I-OCC^)HCU- 1 / 2 . 
oY* 



5.4 Unoccupied states 

A slightly more complex variant of the previous problem arises when we have converged 
a calculation, found the orthonormal states C spanning the occupied subspace, and are 
interested in finding the eigenvalues and eigenstates for the low-lying unoccupied states. For 
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example, let us say that we have converged a calculation in an insulator or semi-conductor, 
where the occupied space specifies the valence band. We wish to find the low-lying conduction 
states in order to study the band structure and band-gap of the material. 

Thus, we start with a fixed Hamiltonian H and a fixed set of occupied states C. 
We aim to find a set of orthonormal unoccupied states D that are orthogonal to C and 
which also minimize the expectation of the Hamiltonian. Specifically, we wish to minimize 
E = Ty(D^ HD) under the orthogonality constraint D^OC = 0. 

We introduce a set of unconstrained states Z. We project out the part of Z lying in 
the occupied subspace by using the projection operator = I — CC^O of Section fi~6| , 

D = p ] ZX- 1/2 where X = (p f Z) ] 0(p ] Z). 

Then, following the results of the previous section, the differential of E is given by 

dE' 



dE = Ti^Hd^ZX^pZ^}) = 2ReTr 
where the derivative of E with respect to Z is given by 



dZ ] 



8Z1 



(Ht) = (/ ~ OCC ^ 1 - ODD^HDX- 1 ' 2 . 

As expected, the derivative has two projection operators: p — I — OCC\ which projects 
out the component lying in the occupied subspace, and (I — ODD^), which projects out the 
component lying in the portion of the unoccupied subspace under consideration. Minimiza- 
tion of E leads to a set of states D that span the lowest-lying unoccupied states. At the 
minimum, the resulting unoccupied subspace Hamiltonian H = D^HD can be diagonalized 
to obtain the desired eigenvalues and eigenstates. 

The energy and its gradient are summarized by 



E(Z) = Tt(D^HD), 
f)E 

= (/ -OCC ] ){I -ODD ] )HDX- 1 ' 2 , 

D = p^ZX- 1 ' 2 , X = $Z)*0{ptZ) , p = I- OCC ] 



5.5 Variational density-functional perturbation theory 

In this final application, we consider perturbation theory within a single-particle formalism, 
which is required to compute response functions. Specifically, we consider the important case 
of linear response, which was first dealt with in |TJ]]. We imagine that we have converged the 
calculation of the zeroth-order (i.e. unperturbed) configuration and have found the zeroth- 
order wave functions Cq for our problem. We now wish to find the first-order changes of 
the wave functions, C%, due to an external perturbation to the system. Depending on the 
type of perturbation applied, the variation C\ allows for the calculation of the corresponding 
response functions. For example, the displacement of atoms along a phonon mode allows for 
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the computation of the dynamical matrix for that mode whereas perturbations due to an 
external electrostatic field allow for calculation of the dielectric tensor. 

Regardless of the physical situation, all perturbations enter as a change in the ionic 
(or external) potential Vi on which drives the electronic system. Letting A be the perturbation 
parameter, we expand any physical quantity A in powers of A and let A n be the coefficient 
of A n in the expansion. A few examples follow 

Vion — Vionfi + ^Vion,l + ^?Vion,2 H ' 

c = c + xc 1 + x 2 c 2 + --- 

n = tiq + Ani + A 2 n 2 + • • • 

As is well known from perturbation theory, the first order change V^ ori) i determines the first 
order shift of the wave functions C\. 

The work of |TJ]] shows that C\ can be obtained via the constrained minimization 



of an auxiliary quadratic functional of C\. In our matrix-based notation, for the case of 
constant fillings (taken to be unity) and the LDA approximation, this quadratic functional 
is given by 



E{C X ) = TrjcjFod-CjOCipiageo] 

+(J r ^i) t Wion,i + OJ [Diag a(n Q )} m - + E n 



The energy E nonvar contains terms that depend only on Co or the Ewald sum over atomic 
positions and need not concern us any further. The zeroth-order Hamiltonian Hq = — + 
Jt[Diag V ]I is the same as that of Eq. (^) where we have simply renamed the zeroth- 
order single-particle potential to V . The diagonal matrix Diag e holds the eigenvalues 
of the zeroth-order Hamiltonian. The vector a(n ) is found by evaluating the function 
a{n) = -py (ne xc (n)) on the zeroth-order electron density n over the real-space grid. The 
vector rii, the first order shift of the electron density, is given by 



2 Re diag (lC Q C\l^) . 



The first order change of the Hartree potential <fii is the solution of the Poisson equation 

Given the quadratic nature of E(C\), its differential with respect to C\ follows im- 
mediately and is given by 

dE = 2Re Tr [dC\ (h q Ci - OC x [Diag e Q ] + X f [Diag Vi] XC ) } , 

where the first-order single-particle potential V\ is given by 

V x = J^Vion,! + J^OJ [Diag a(n )} m + [Diag a(n )] J^OJm - J^Ofa . 

The constraint to be obeyed during the minimization is that the first-order shifts C\ 
be orthonormal to the zeroth-order wave functions Co, 

C\OC = I . 
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This constraint is easily handled in the manner of the previous section by using a projection 
operator. We introduce an unconstrained set of wave functions Y\ from which we project 
out the part laying in the space spanned by Co, 

c 1 = (/-c r cjo)y 1 . 

Based on this relation, we find the gradient of E with respect to Y\ 



dE 2 Ro I ■{ f/1', 1 I —- I } where 
dE 



fj = (I- OC Cl) [h q C x - OdfDiageo] + X f [Diag V x ] 1C } . 

Finally, we can convert the energy function into a Lagrangian by letting <pi be a free 
variable and by adding the appropriate Hartree self-energy and coupling to n x . We arrive at 
the summarized expressions 



C{Y u <j>i) = Tr{C 1 t J Ff C 1 -C 1 t OC 1 [Diage ]} + J B nonwr . 

+{Jni) ] {V ion>1 + OJ [Diag a(n )] m - Ofa} + —<t>[l4\ 



dC 



— T = (J-OC C t ){iJoCi-OC 1 [Diageo]+X t [DiagF 1 ]JCo} 
oY{ L J 



9£ 1 * n- 1 r l 



6 Minimization algorithms 

In this section, we show how the DFT++ formalism can succinctly specify the algorithm 
which finds the stationary point of the Lagrangian or energy function (derived in the pre- 
vious sections). Such an algorithm only requires the value and derivative of the objective 
function, which is the reason that we have repeatedly emphasized the importance of these 
two quantities in our analysis above. Once we choose a minimization algorithm, we need 
only "plug in" the objective function and its derivative into the appropriate slots. Further- 
more, since the DFT++ formalism is compact and at the same time explicit, once we specify 
the operations that must be performed for a given algorithm in our notation, the transition 
to coding on a computer is trivial: the formulae translate directly into computer code (as 
shown in Section |7|). 

Specifically, we aim to find the stationary point of the Lagrangian of Eq. ( p8|) with 
respect to Y and <fi and possibly B. A direct search for the stationary point is possible, and 
at the saddle point, both the Kohn-Sham and Poisson equations hold true simultaneously. 
This highly effective strategy has been followed in 0. Alternatively, other approaches to 
reach a solution of these equations through self-consistent iteration and use of Broyden-like 
schemes |15] may be considered. 
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calc-energy(F) : calculates U, U 1//2 , C, n, 0, and 
returns the LDA energy 

• Calculate the overlap U = Y^OY 

• Diagonalize U = WfiW^ and calculate U~ 1 ' 2 = W n~ l ' 2 W^ 

• Calculate the orthonormal states C = YU~ l l 2 

• Calculate the electron density n = f ■ diag |(XC)(ZC)^| 

• Solve the Poisson equation = 4irL~ 1 OJ'n 

• Return the LDA energy Elda' 



E LDA = -^Ti(C^LC) + (Jny 



V ion + OJe xc {n) - ^04> 



Figure 2: LDA energy routine (DFT++ formalism) 



However, in order to make direct contact with DFT calculations within the traditional 



minimization context J7|, and to keep our presentation as simple as possible, we choose 
instead to solve the Poisson equation (Eq. fl42|)) for the optimal at each iteration of the 
minimization algorithm. For cases where L _1 is easy to compute (e.g. the plane-wave basis 
where L is diagonal), we may compute the solution directly from Eq. (f43[). Otherwise, 
the straightforward approach of maximizing the quadratic functional G(<f>) = (l/87r)^ L<p — 
(J^nyOcj) via conjugate gradients (or some other method) achieves the same goal. For 
certain basis sets, multigrid methods or other specialized techniques are possibilities as well 



16, 1/]. Once the Poisson equation has been solved, the remaining free variable is the 



matrix of coefficients Y, and the aim of the calculation is to minimize the energy E^da of 
Eq. ( pR| ) with respect to Y. 

As shown in Section [3], the value and F-derivatives of the Lagrangian Clda and 
energy Elda are identical if we evaluate the Lagrangian-based expressions using the Hartree 
potential which is the solution of Poisson's equation. Therefore, in our algorithms below, 
we can use expressions derived for derivatives of the Lagrangian when dealing with the 
energy. 



6.1 Semiconducting and insulating systems 

Consider the case of a semiconducting system with a large unit cell. The fillings are constant, 
F = fl, and we will use a single fc-point at k = (as appropriate for a large cell). The 
simplest algorithm for minimizing the energy is the steepest descent method: we update Y 
by shifting along the negative gradient of the energy with respect to Y, scaled by a fixed 
multiplicative factor. As a first organizational step, we introduce the function calc-energy(F), 
whose code we display in Figure Given Y, this function computes the overlaps U and 
U^ 1 / 2 , the orthonormalized C, the electron density n, the solution to Poisson's equation 0, 
and returns the LDA energy. Figure |^ displays the steepest descent algorithm as it appears 
in the DFT++ language. 
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Calculate the ionic potential Vi on (Eq. (p~3|) ) 
Randomize Y and choose the stepsize A 
Iterate until E L da is sufficiently converged: 
o Calculate the energy: E L da = calc-energy(F) 
o Calculate the single-particle potential V sp : 

V sp = J ] [V lon + OJt xc {n) - 0<f>] + [Diag e' xc {n)]J^OJn 
o Calculate the gradient of Elda' 

G = d -^^ = f(I - OCC^HCU- 1 ' 2 
o Take a step down the gradient: Y — XG — > Y 

Figure 3: Steepest descent algorithm 



quadratic-line- min(Y, X, E, G): quadratic minimization of Elda 

X is the search direction for the minimization 
E is the energy Elda at Y 
G is the gradient of Elda at Y 

• Compute the directional derivative of Elda at Y: A = 2 Re Tr(X^G) 

• Compute the energy at trial position E t = calc-energy(F + XtX) 

• Compute the curvature of the energy function c = [E t — (E + AtA)]/Af 

• Compute the minimizing shift A = —A/ (2c) 

• Shift to the minimum: Y + AX — > Y 



Figure 4: Quadratic line minimizer 



We would like to emphasize a number of points regarding this code. First, the algo- 
rithm optimizes all the wave functions (i.e. the entire matrix Y) at once, leading to a very 
effective minimization and a complete avoidance of charge creep or other instabilities. 
Second, the code is independent of the basis set used: the basis-dependent operators L, 
O, etc., are coded as low- level functions that need to be written only once. The choice of 
physical system and minimization algorithm is a high-level matter that is completely de- 
coupled form such details. Third, the figure shows all the operations required for the entire 
minimization loop. That this is possible is grace to the succinct matrix formalism. 

The only part of the algorithm of Figure § that is specific to the steepest descent 
method is the last operation where Y is updated. To generalize to a preconditioned conjugate- 
gradient algorithm is quite straightforward, and we specify the necessary changes below. 
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• Calculate the ionic potential Vi on 

• Randomize Y 

. Forj=0, 1,2,... 

o Let Ej = calc-energy(lj) 

o Calculate the single-particle potential V sp 

o Calculate the gradient: Gj = (OElua/ 'dY')^ 

o Apply the preconditioner: Tj = K(Gj) 

o If j > then set oij = Re Tr(G]rj)/Re Tr(G J t_ 1 r 3 -_i); otherwise 
OLj = 0. 

o Compute the search direction Xj = Tj + ct/Xj-i 
o Call quadratic-line-min(>j, Xj, Ej, Gj) 



Figure 5: Preconditioned conjugate- gradient algorithm 



Conjugate-gradient algorithms require line minimization of the objective function 
along a specified direction, i.e. an algorithm is needed that minimizes E(X) = Elda(Y + 
AX) with respect to the real number A for a fixed search vector X. The subject of line 
minimization is rich, and an abundance of algorithms with varying degrees of complexity exist 
in the literature. (For a brief introduction see ||18|| .) However, for typical DFT calculations, 
most of the effort for the calculation is spent in the quadratic basin close to the minimum. 
Thus, a simple and efficient line-minimizer that exploits this quadratic behavior should be 
sufficient, and we have found this to be the case in our work. 

Our line-minimizer takes the current value of the energy and its derivative as well as 
the value of the energy at the shifted trial configuration Y + X t X (where X t is a trial step-size) 
to achieve the quadratic fit E(X) pa E + A A + cA 2 . Here, A is the directional derivative of 
E with respect to A, and c is the curvature of E with respect to A. The line-minimizer then 
moves to the minimum of this quadratic fit located at A = — A/ (2c). Figure f| shows the 
code for this line-minimizer. 

Using this line minimizer, we present the entire preconditioned conjugate-gradient 
algorithm in Figure |5|. Note that we have omitted some of the formulae which are identical 
to those of Figure^. A simple diagonal preconditioning, as described in J7J, is quite effective 
for plane-wave basis sets, and the operator K is the preconditioner in the algorithm of the 
figure. 

6.2 Metallic and high-temperature systems 

While the degrees of freedom in the variable Y are sufficient to describe any electronic 
system, the convergence of minimization algorithms can be hampered by ill-conditioning of 
the physical system. A standard case of such a problem is when the Fermi-Dirac fillings fi 
are not constant and some fillings are very small, a situation encountered when studying 
metals or high-temperature insulators. 

One type of ill-conditioning that arises due to states with small fillings, fa <C 1, stems 
from the fact that changes in such states do not affect the value of the energy E^da very 
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much when compared to the states with large fillings, fi ~ 1. Thus modes associated with 
the small-filling states are "soft" and we have an ill-conditioned problem. The solution to 
this problem, however, is rather straightforward and involves scaling the derivative of Clda 
with respect to the state ipi by 1/ fi. 

Much harder to deal with are soft modes due to subspace rotations which were intro- 
duced in Section [Q[ As we saw there, the unitary transformations V, which generate the 
rotations, cancel out completely from the expression for the density matrix P in the case of 
constant Fermi fillings, F = fi. Since the entire energy can be computed from P alone, the 
energy will not depend on V. Hence we have found that the unitary transformations V are 
an exact symmetry of a system with constant fillings. 

However, once we introduce variations in the fillings, the symmetry is broken. Now 
both the density matrix and the energy change when V mixes states with different fillings. 
If the difference in fillings between the mixed states is small, a case typically encountered in 
a real system, the mixing produces small changes in the energy. Again, we have soft modes, 
this time due to the breaking of the unitary subspace-rotation symmetry. 

The idea of using subspace rotations was first suggested in [|TJ| . Its use as a cure for 



the ill-conditioning described above was discussed and demonstrated convincingly in [|19 
The strategy is first to minimize the objective function over B (since B parameterizes the 
rotation V) and only then perform minimization on the wave functions Y. By ensuring that 
we are always at the minimum with respect to B, we automatically set the derivative of 
Elda with respect to subspace-rotations to zero. When this is true, changing Y can not 
(to first order) give rise to contributions due to the soft modes, and we have eliminated the 
ill-conditioning. 

In practice, we have found it unnecessary for our calculations to be at the absolute 
minimum with respect to B: being close to the minimum is sufficiently beneficial compu- 
tationally. In our algorithm, we take steps along both Y and B simultaneously but ensure 
that our step-size in B is much larger than that in Y. As the minimization proceeds, this 
choice automatically drives the system to stay close to the minimum along B at all times. 



Our simpler procedure has been found as effective as the original strategy of [|19|] and 
translates into using the following search direction X in the space (Y, B) 

( OClda dC LD A\ 

x= [-wr^-^B-) 

where 7] is a numerical scaling factor. We have found r] ~ 30 to be a good choice for efficient 
minimization while avoiding the ill-conditioning mentioned above. 

As a practical showcase of the improvement in convergence in a metallic system, 
we study the convergence rate for the conjugate- gradient minimization of the energy of 
bulk molybdenum. We study the bcc cubic unit cell containing two Mo atoms. We use a 
plane-wave basis set (details of implementation in Appendix with an electronic cutoff 
of 22.5 Hartree (45 Ryd), and a 4x4x4 cubic fc-point mesh to sample to Brillouin zone. 
The electronic temperature used is &bT=0.0037 Hartree (0.1 eV), the Fermi fillings are 
recomputed every twenty conjugate-gradient steps based on the eigenvalues of the subspace 
Hamiltonian from the previous iteration, and a value of rj = 30 is used to calculate the search 
direction. Non-local pseudopotentials of the Kleinmann-Bylander form [20] are used with 



p and d projectors. Figure presents a plot of the convergence of the energy per atom to 
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50 100 150 200 

CG iterations 



Figure 6: Effect of subspace rotation on convergence - Convergence of conjugate-gradient 
minimization with (solid) and without (dashed) the use of subspace rotations for the case of 
a metallic system. Both minimizations use the same random wave functions as their starting 
points. The horizontal axis is the number of conjugate- gradient iterations and the vertical 
axis is the energy per atom above the minimum energy in Hartree per atom in logarithmic 
units. See the text for further details. 



its minimum value (as determined by a run with many more iterations than shown in the 
figure). We compare minimization with and without the use of subspace rotation variables, 
and both minimizations are started with the same initial random wave functions. The extra 
cost required for the use of subspace rotations was very small in this case, the increase in 
the time spent per iteration being less than two percent. As we can see, the use of subspace 
rotations dramatically improves the convergence rate. 

7 Implementation, optimization, and parallelization 

In this final section we address how the DFT++ formalism can be easily and effectively 
implemented on a computer, and what steps must be taken to ensure efficient optimization 
and parallelization of the computations. As is clear from the previous sections, the DFT++ 
formalism is firmly based on linear algebra. The objects appearing in the formalism are 
vectors and matrices. The computations performed on these objects are matrix addition 
and multiplication and the application of linear operators. An important benefit is that 
linear-algebraic products involve matrix-matrix multiplications (i.e. BLAS3 operations), 
which are precisely those operations that achieve the highest performance. 

We use an object-oriented approach and the C++ programming language to render 
the implementation and coding as easy as possible. In addition, object-oriented program- 
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ming introduces modularity and localization of computational kernels allowing for effective 
optimization and parallelization. In the sections below, we present outlines of our imple- 
mentation, optimization, and parallelization strategies. 

7.1 Object-oriented implementation in C-\ — |- 

In our work, we have found that an object-oriented language such as C++ is ideal for 
implementing the required vectors and matrices and for defining the operations on them in 
a manner that follows the DFT++ formalism as closely as possible. The object-oriented 
approach presents a number of advantages. 

First, the programming task becomes highly modular: we identify the object types 
needed and the operations that must be performed on them, and we create a separate module 
for each object that can be tested independently. For example, we define the class of matrices 
and the operations on them (e.g. multiplication, addition, diagonalization, etc.), and we can 
test and debug this matrix module separate from any other considerations. Second, we gain 
transparency: the internal structure or functioning of an object can be modified for improved 
performance without requiring any changes to higher-level functions that use the object. This 
gives us a key feature in that the high-level programmer creating new algorithms or testing 
new energy functionals does not need to know about or interact with the lower-level details 
of how the objects actually are represented or how they function. Third, this separation of 
high-level function from low-level implementation allows for a centralization and reduction 
in the number of computational kernels in the code: there can be a large variety of high-level 
objects for the convenience of the programmer, but as all the operations defined on them are 
similar linear-algebraic ones, only a few actual routines must be written. Fourth, modularity 
produces shorter and more legible code. This, combined with the object-oriented approach, 
implies that the high-level computer code will read the same as the equations of the DFT++ 
formalism so that debugging will amount to checking formulae, without any interference of 
cumbersome loops and indices. 

To give a concrete example of what this means, consider the simplest object in the 
formalism, a column vector such as the electron density n. In C++, we define an object 
class vector which will contain an array of complex numbers (itself a lower-level object). 
A vector v has a member v. size specifying the number of rows it contains as well as a 
pointer to the array containing the data. We define the action of the parenthesis so as to 
allow convenient access of the ith element of v via the construction v(i). 

A very common operation between two vectors v and w is the scalar product v^w. 
We implement this by defining (in C++ parlance overloading) an operator " that takes two 
vectors and returns a complex result. The code accomplishing this is 

friend complex operator" (vector &v, vector few) 
{ 

complex result = 0; 

for(int i=0; i < v. size; i++) 

result += conjugate (v(i))*w(i) ; 

return result ; 
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} 



An example of an operator acting on vectors is the inverse transform J . This operator 
can be coded as a function J that takes a vector v as its argument and returns the vector 
result J(v). Of course, the details of what J does internally are basis-dependent. 

Based on this definition, the evaluation of the electron-ion interaction energy of Sec- 
tion [4.3.2|, given by the expression E e _i = {JnyV ion , is coded by 



E_ei = J(n)~V_ion; 



where V_ion is the vector V ion of Eq. 

Following the same idea, we define a matrix class to handle the matrices such as U, 
W, H, etc. that occur in the formalism. Physically, expansion coefficients such as Y and 
C are collections of column vectors (a column of coefficients C a i for each wave function ipi), 
and we define the class column_bundle to handle these objects. Although mathematically 
column_bundles such as Y and C are matrices, the choice not to use the matrix class for 
representing them is deliberate, as Y and C have a distinct use and a special physical meaning 
that an arbitrary matrix will not have. In this way, we can tailor our objects to reflect the 
physical content of the information they contain. Of course, when a matrix multiplication 
is performed, such as when we evaluate the expression C = YU^ 1 ^ 2 , the column_bundle 
class and matrix class call a single, low-level, optimized multiplication routine. We thus 
gain flexibility and legibility of codes on the higher levels while avoiding an accumulation of 
specialized functions on the lower levels. 

As a concrete example of the power and ease of this style of programming, consider 
writing the code for the function calc-energy(Y) of Figure ^. In order to do so, we will 
need a few more operators. Following the example of the vectors, we define the ~ operator 
between two column_bundles to handle Hermitian-conjugated multiplications such as Y± Y-j, 
where the result of the product is of type matrix. Next, we define the * operator between a 
column_bundle and matrix to handle multiplications of the type YU' 1 ^ 2 , where the result 
is a column_bundle. We code the action of the basis-dependent operators such as O or X on 
a vector <p or a column_bundle C as function calls O(phi) or 1(C), which return the same 
data type as their argument. Finally, we implement multiplication by scalars in the obvious 
way. Figure [7] presents the C++ code for the energy calculation routine. The code is almost 
exactly the same as the corresponding expressions in the DFT++ formalism, making the 
translation from mathematical derivation to actual coding trivial. 



7.2 Scalings for dominant DFTH — \- operations 

Before we describe our approach to optimization and parallelization, we will work out the 
scalings of the floating-point operation counts as a function of system size for the various 
computational operations in the DFT++ formalism. Thus it will be clear which optimiza- 
tions and parallelizations will increase the overall performance most efficaciously. Let n be 
the number of wave functions {tpi} and let N be the number of basis functions {b a (r)} in the 
calculation. A vector contains N complex numbers, a matrix is nxn, and a column_bundle 
is N x n and is the largest data structure in the computation. Both n and N are proportional 
to the number of atoms n a in the simulation cell, or equivalently, to the volume of the cell. 
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c 


omplex calc_energy (column_bundle &Y, 


column_bundle &C, 




mauriX 06U , matrix 


jMTm"h 3 1 -F 




vector &n, vector 


&phi , 




double f , vector 


KV_ion > ) 


{ 








complex E_LDA; 






U = Y~0(Y) ; 






// calc_Umhalf (U) returns U~{-l/2} 


given U 




Umhalf = calc_Umhalf (U) ; 






C = Y*Umhalf ; 






// diag_of _self _product (X) returns diag(X*adjoint(X)) 




n = f *diag_of _self .product (I (C) ) ; 






phi = (4.0*PI)*invL(0bar(n)) ; 






E_LDA = -0.5*f*Tr(C~L(C)) + 






J(n)~( V_ion + 0(J(exc(n))) 


- 0.5*0bar(phi) ) ; 




return E_LDA; 




} 







Figure 7: LDA energy routine (C++ implementation) - C++ code for the calc-energy(y) 
function which was outlined in Figure When computing the electron density n, we have not 
used the straightforward code diag(X*adjoint (X) ) which first computes the entire matrix 
X*adjoint(X) before extracting its diagonal and is thus computationally wasteful. Rather, 
for performance reasons, we have written a separate function diag_of _self _product (X) 
that given a column_bundle X computes only the required diagonal portion of the product 
X*adjoint(X). 



Typically, for accurate DFT calculations, the number of basis functions required is much 
larger than the number of quantum states, N ^> n. 

In Table [l] we present the floating-point operation counts for the different operations 
required in the DFT++ formalism. We note that for very large systems, the basis-set 
independent matrix products dominate the overall computational workload. However, for 
medium-sized or slightly larger problems, the application of the basis-dependent operators 
can play an important role as well. 

For most basis sets, there are techniques available in the literature that allow for effi- 
cient application of the basis-dependent operators to a column vector in O(N) or O(NhaN) 
operations. For example, when working with plane waves, the Fast Fourier transform 
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Operation 


Examples 


FLOP count 


co lumn_buridl e 


M = YlY 2 or 


0{n 2 N) 


multiplications 


C = YU- 1 ' 2 




matrix multiplications 


U = WfiW^ or 


0(n 3 ) 


and diagonalizations 


[TV* = WfJl -l/2 W ] 




Applying basis-dependent 


OY, LC, or 


0{nN) or 


operators 


XC 


0(nN In N) 


Calculating n 






given XC or 
calculating the kinetic 


n = diag{(XC)F(XC)t} 
or Tt{FC\LC)} 


0(nN) 


energy given LC 






vector operations 


e xc (n), (JnyVion, or 


O(N) 



Table 1: Floating-point operation count for various common operations in the DFT++ 
formalism. The size of the basis set is N and the number of wave functions is n. Thus a 
column_bundle is N x n, a matrix is n x n, and a vector is N long. 



is the algorithm of choice for implementing the operators X, J , X', and and allows us 
to affect these operators in O(NlaN) operations. (See Appendix [A| for the details of a 
plane- wave implementation.) The operators L and O are already diagonal in a plane- wave 
basis and are thus trivial to implement. For real-space, grid-based computations, multigrid 
methods [ffB| are highly effective for inverting L and solving the Poisson equation. Special 
techniques exist for multiresolution |BL |5], and Gaussian (FT?] basis sets that allow for the 
application of the basis-dependent operators in 0(N) operations as well. 



7.3 Optimization of computational kernels 

Due to the modularity of our object-oriented approach, the physical nature of the problem 
under study is a high-level issue that is completely independent of the functioning of the 
few, low-level computational kernels which handle most of the calculations in the code. The 
aim now is to optimize these kernels to obtain maximum computational performance. By 
optimization we mean increasing performance on a single processor. Parallelization, by which 
we mean distribution of the computational task among several processors, will be addressed 
in the next section, but good parallel performance is only possible when each processor is 
working most effectively. 

The computationally intensive operations involved in the DFT++ formalism fall into 
two overall classes. First are the application of the basis-dependent operators such as L, O, 
X, etc., whose operation counts scale quadratically in the system size (see Table D). Given 
their basis-dependent nature, no general recipe can be given for their optimization. However, 
for many widely used basis sets, highly efficient methods exist in the literature which allow 
for the application of these operators, and we refer the reader to the references cited in the 
preceding section. For the case of plane waves, we have used the FFTW package to 
affect the Fourier transformations. This highly portable, freely obtainable software library 
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provides excellent per processor performance across many platforms. 

The second class of operations are the basis-independent matrix products, and we 
will now consider the optimization of these operations. As a case study, we examine the 
Hermitian-conjugated matrix product between two column_bundles, which occurs in an ex- 
pression such as Y1Y2 and which is coded using the column_bundle~column_bundle operator 
as Y_1~Y_2. The most "naive" and straightforward implementation of this operator in C++ 
is given by 

friend matrix operator" (column_bundle &Y_1, column_bundle &Y_2) 

{ 

// create a matrix of size Y_l .n_columns x Y_2 .n_columns 

// to hold the result 

matrix M(Y_1 .n_columns,Y_2.n_columns) ; 

int i,j,k; 

for(i=0; i < Y_l .n_columns ; i++) 

for(j=0; j < Y_2 .n_columns ; j++) { 
M(i,j) = 0; 

for(k=0; k < Y_l.n_rows; k++) 

M(i,j) += conjugate(Y_l(k,i))*Y_2(k,j); 

} 

return M; 

} 

While easy to follow, this implementation is quite inefficient on modern computer 
architectures for large matrix sizes because the algorithm does not take any advantage of 
caching. The access pattern to memory is not localized, and the processor must continually 
read and write data to the slower-speed main memory instead of to the much faster (but 
smaller) cache memory. 

One solution to this problem is to resort to vendor-specific linear algebra packages. 
For example, one can use a version of LAPACK optimized for the computational platform at 
hand, and this generally results in very good performance. An alternative choice is to perform 
the optimizations by hand. While this second choice may sacrifice some performance, it does 
ensure highly portable code, and this is the strategy that we have followed in our work. 

Our basic approach to increasing performance is to use blocking: we partition each 
of the input and output matrices into blocks of relatively small dimensions, and the matrix 
multiplication is rewritten as a set of products and sums over these smaller blocks. Provided 
that the block sizes are small enough, say 32x32 or 64x64 for todays' typical processor and 
memory architectures, both the input and output blocks will reside in the high-speed cache 
memory and fast read/write access to the cache will dramatically improve performance. 
Figure [8] shows a schematic diagram of how the product M = Y^Y 2 would be carried out in 
a blocked manner. 

Please note that due to blocking, the task of optimization is now also modularized. 
We need only write a single, low-level, block-block matrix multiplication routine that should 
be highly optimized. All higher-level matrix multiplication functions will then loop over 
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Figure 8: Blocked matrix multiplication - A schematic of the blocked matrix multiplication 
method applied to computing the product M = YlY^. The blocks m^-, a^-, and bij are 
matrices of small size (e.g. 32 x 32). Our schematic shows how each of the matrices is 
partitioned into blocks and how the result blocks are to be computed. 



blocks of the input and output data and copy the contents to small, in-cache matrices which 
are then multiplied by calling the low-level multiplier. 

Applying these ideas to our M = Y1Y2 example, the rewritten code for the 
column_bundle~column_bundle operator takes the following form: 

friend matrix operator" (column_bundle &Y_1, column_bundle &Y_2) 
{ 

matrix M(Y_1 .n_columns,Y_2.n_columns) ; // M=Y_1~Y_2 holds the result 



int B = 32; // Pick a reasonable block size 



matrix inl(B,B) ,in2(B,B) ,out(B,B) ; // input and output blocks 



int ib, jb,kb,i, j ,k; 



// loop over blocks of the output 
for(ib=0; ib < Y_l .n_columns ; ib+=B) 

for(jb=0; jb < Y_2 .n_columns ; jb+=B) { 



// zero the output block 
for(i=0; i < B; i++) 
for(j=0; j < B; j++) 
out(i,j) = 0; 



// loop over blocks to be multiplied 
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for(kb=0; kb < Y.l.nrows; kb+=B) { 



// load data into input blocks 
for(i=0; i < B; i++) 

for(k=0; k < B; k++) { 

inl(i,k) = conjugate (Y_l (kb+k, ib+i) ) ; 

in2(i,k) = Y_2(kb+k, jb+i) ; 

} 

// perform the block multiplication 
low_level_mutliply (inl , in2, out, B) ; 

} 

// write the result to memory 
for(i=0; i < B; i++) 
for(j=0; j < B; j++) 

M(ib+i, jb+j) = out(i, j) ; 

} 



return M; 

} 

The function low_level_multiply performs the block-block multiplication of the in- 
put blocks and accumulates into the output block. Not only the column_bundle~column_bundle 
operator but all matrix multiplications can be blocked in a similar way and will thus call 
the low-level multiplier. 

The simplest implementation for low_level_multiply is the straightforward one: 

void low_level_multiply (matrix &inl, matrix &in2, 

matrix &out, int B) 

{ 

int i,j,k; 

for(i=0; i < B; i++) 

for(j=0; j < B; j++) { 
complex z = 0; 
for(k=0; k < B; k++) 

z += inl(i,k)*in2(j ,k) ; 
out(i,j) += z; 

} 

} 

The use of this simple low-level multiplier combined with blocking can provide a 
tremendous improvement. Figure [| shows a plot of the performance of the M = YiY 2 
blocked-multiplication as a function of the block size when run on a 333 MHz SUN Ultrasparc 
5 microprocessor. For comparison, the performance of the "naive" code with no blocking, 
which was presented above, is also indicated in the figure. Initially, the performance increases 
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Figure 9: Matrix-multiplication FLOP rates (single processor) - Effect of blocking on com- 
putational performance for the matrix product M = Y^Yz, where Y\ and Y 2 are 10,000x200 
complex-valued matrices. The horizontal axis shows the size of the square blocks used for 
the block- multiplication scheme described in the text. The vertical axis is the performance 
in mega floating-point operations per second (MFLOPS). The horizontal dashed line shows 
the rate for a non-blocked "naive" multiplication routine (see text). 



dramatically with increasing block size due to more effective caching. However, there is an 
optimal size above which performance decreases because the blocks become too large to fit 
effectively into the cache. On most computational platforms that we have had experience 
with, this simple blocked multiplier runs at at least half the peak theoretical rate of the 
processor, as exemplified by the results in the figure. Further speedups can be found by 
rewriting the low_level_multiply routine so as to use register variables (as we have done 
and found up to 40% performance enhancements) or by using a hierarchical access pattern 
to memory which can provide better performance if the memory system has multiple levels 
of caching. 

7.4 Parallelization 

Once the computer code has been optimized to perform well on a single processor, the 
computation can be divided among multiple processors. We now discuss how this division 
can be achieved effectively. 

The architectures of modern parallel supercomputers can generally be divided into 
two categories: shared-memory (SMP) versus distributed-memory (DMP) processors. In the 
SMP case, a set of identical processors share access to a very large repository of memory. The 
main advantages of shared memory are that the processors can access whatever data they 
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need, and that, with judicious choice of algorithms, very little inter-processor communication 
is required. In addition, only small changes are required to parallelize a serial code: the 
computational task is divided among the processors, and each processor executes the original 
serial code on the portion of the data allotted to it. However, as the number of processors 
increases, the traffic for accessing the main memory banks increases dramatically and the 
performance will stop to scale well with the number of processors utilized. Nevertheless, 
many mid-sized problems are well suited to SMPs and can take full advantage of the key 
features of SMP systems. Examples of such calculations can be found in |23, 24]. 



The largest of today's supercomputers have distributed memory: each processor has 
a private memory bank of moderate size to which it has exclusive access, and the processors 
communicate with each other by some message passing mechanism. The main advantages 
of DMP are scalability and heterogeneity, as the processors need not all be identical nor 
located in close physical proximity. However, the price paid is the necessity of an inter- 
processor communication mechanism and protocol. In addition, computer algorithms may 
have to be redesigned in order to minimize the required communication. Furthermore, a slow 
communication network between processors can adversely affect the overall performance. 

We will describe, in outline, the strategies we employ for both SMP and DMP archi- 
tectures, and we will continue to examine the case of the 

column_bundle"column_bundle matrix multiplication as a specific example. As our results 
for actual calculations will demonstrate, we only need to parallelize two main operations 
in the entire code, (a) the application of basis-dependent operators such as X or L to 
column_bundles (which is trivial) and (b) the matrix products involving column_bundles 
such as Y±Y 2 or YU^ 1 ^ 2 , to obtain excellent or highly satisfactory performance on SMP and 
DMP systems. 



7.4.1 Shared-memory (SMP) architectures 

Since all processors in an SMP computer have access to all the data in the computation, 
the parallelization of the basis-dependent operators is trivial. For example, the operation 
IC involves the application of X to each column of C separately, and the columns can be 
divided equally among the processors. This leads to near perfect parallelization as none of 
the processors read from or write to the same memory locations. 

Based on the discussion of Section [7.2| , for large problems, the most significant gains 
for parallelization involve the basis-independent matrix-products. Below, we focus on the 
c o lumn_bundl e ~ co lumn_bundl e operation study. For this operation, it is impossible 

to distribute the task so that all processors always work on different memory segments. 
However, we divide up the work so that no two processors ever write to the same memory 
location: not only does this choice avoid possible errors due to the synchronizations of 
simultaneous memory writes, it also avoids performance degradation due to cache-flushing 
when memory is written to. 

Consider the matrix product M = Y±Y 2 , which we have implemented as a blocked 
matrix multiplication. Parallelization is achieved simply by assigning each processor to 
compute a subset of the output blocks. The main program spawns a set of subordinate tasks 
(termed threads) whose sole purpose is to compute their assigned output blocks and to write 
the results to memory. The main program waits for all threads to terminate before continuing 
onwards. Referring to Figure ||, this strategy corresponds to distributing the computation of 
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the blocks m^- among the processors, and since memory is shared, all processors have access 
to all of the data describing Yi and Y 2 at all times. For large problem sizes, the overhead in 
spawning and terminating the threads is far smaller than the work needed to compute the 
results, so the simplicity of this strategy does not sacrifice performance. 

We now present the code which accomplishes this parallelized matrix product in order 
to highlight the ease with which such parallelizations can be performed. In the interest of 
brevity, we assume that the number of columns of Yi is a multiple of the number of threads 
launched. Parallelization is affected by assigning different threads to compute different rows 
of the result M = Y?Y 2 . 

friend matrix operator" (column_bundle &Y_1, column_bundle &Y_2) 
{ 

matrix M(Y_1 .n_columns , Y_2 .n_columns) ; // M=Y_1~Y_2 holds the result 

int n_threads =8; //a reasonable number of threads 

int thread_id[n_threads] ; 
int i, start, n; 

for (i=0; i < n_threads; i++) { 

// The set of rows of M that this thread should compute 
n = Y_l .n_columns/n_threads; 
start = i*n; 

// Launch a thread that runs the routine calc_rows_of _M 
// and pass it the remaining arguments. Store the thread ID. 
thread_id[i] = launch_thread(calc_rows_of _M, Y_l, Y_2, M, 

start , n) ; 

} 

// Wait for the threads to terminate 
for (i=0; i < n_threads; i++) 

wait_f or_thread(thread_id[i] ) ; 

return M; 

} 

The routine calc_rows_of_M(Y_l,Y_2,M, start, n) computes rows start through 
start +n-l of M, where M=Y_1~Y_2. The routine's algorithm is identical to that of the blocked 
multiplier presented in the previous section. The only change required is to have the ib loop 
index start at start and end at start+n-1. 

We parallelize other matrix multiplications involving column_bundles analogously. In 
addition, we parallelize the application of the basis-dependent operators as discussed above. 
For a realistic study of performance and scaling, we consider a system of bulk silicon with 
128 atoms in the unit cell. We use a plane-wave cutoff of 6 Hartrees (12 Ryd) which leads 
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Figure 10: Scaling for SMP parallelization - Performance of our SMP parallelized plane- 
wave code for a 128 atom silicon system. We show the performance of the parallelized 
M = Y±Y 2 multiplications (circles) and the overall code (pluses) as a function of the number 
of processors . In both cases, performance has been normalized to the respective performance 
with a single processor. The straight line with slope of unity represents ideal scaling for 
perfect parallelization. 



to a basis of size N = 11797. We use Kleinmann-Bylander [2(J non-local pseudopotentials 
with p and d non-local projectors to eliminate the core states, and we have n = 256 bands of 
valence electrons. We sample the Brillouin zone at k — 0. In Figure 10, we show a plot of the 
performance of the parallelized M = Y^Y 2 multiplication as well as the overall performance 
of the code for a single conjugate-gradient step as a function of the number of processors 
employed. The calculation was run on a Sun Ultra HPC 5000 with eight 167 MHz Ultrasparc 
4 microprocessors and 512MB of memory. 

As the figure shows, the parallelized M = Y\Yi matrix multiplication shows near 
ideal scaling. There are a number of reasons for this behavior: (1) since different processors 
always write to different segments of memory, the algorithm does not suffer from memory 
write-contentions, (2) the inputs Y\ and Y 2 are never modified so that memory reads are 
cached effectively, and (3) the problem size is large enough so that each processor's workload 
is much larger than the overhead required to distribute the work among the processors. 
This scaling is all the more impressive because when using eight processors, each processor 
performs the multiplications at 140 MFLOPS or at 83% of the processor clock rate. 

The figure also displays the total performance of the code, which shows evidence of 
saturation. To understand this behavior in more detail, we model the overall execution time 
in accordance with Amdahl's law. We assume that there exists an intrinsic serial computation 
time Tq which must be spent regardless of the number of processors available. In addition, 
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Figure 11: Amdahl's analysis of SMP scaling - Total execution time of the SMP parallelized 
plane-wave code for a 128 silicon atom system as a function of the reciprocal of the number 
of processors used. The line of least-squares fit to the data points is also shown. Execution 
times are normalized in units of the execution time for a single processor. 



there is an analogous parallel computation time T which, however, is divided equally among 
all the processors. Thus the total execution time will be given by T = T + T/p, where p is 
the number of processors. We show a plot of the total execution time versus 1/p in Figure 
|TT| , and the model shows an excellent fit to the available data. The vertical asymptote of 
the least-squares fit straight line is the extrapolated value of To, in this case approximately 
10% of the single processor or serial execution time. Thus the few operations that we have 
chosen to parallelize do in fact constitute the largest share of the computational burden and 
our parallelization strategy is quite effective. 

When we reach the data point with eight processors, the total execution time is 
already within a factor of two of T , so that the total serial and parallel components have 
become comparable. Indeed, timing various portions of the code confirms this hypothesis: 
for example, with eight processors, the time needed to perform a parallel multiplication 
M = YiY 2 is equal to the time needed by the Sun high-performance LAPACK library to 
diagonalize the overlap matrix U (cf. Eq. fl3lD ). With eight processors, the code has each 
processor sustaining an average of 134 MFLOPS or 80% of the processor clock rate. We 
are quite satisfied with this level of performance, but if more improvements are desired, the 
remaining serial portions of the code can be further optimized and parallelized. 



43 



Y 



Y 1 



N 



yoo 


Voi 


yo2 


yio 


yu 


yn 


2/20 


y2i 


2/22 




n 





! 




1 


T 

yoo 


T 

2/io 


T 
2/20 


T 

2/oi 


T 

2/ii 


T 
2/21 


T 

y 2 


T 
2/12 


T 
2/22 








N 



n 



Figure 12: Transposition of distributed matrices - Schematic diagram showing how the 
transpose Y T of the distributed column_bundle Y is obtained in a DMP calculation, which 
in this example has p = 3 processors. Across all the processors, Y is N x n and F T is 
n x N. Solid vertical lines show the distribution of the columns of Y or Y T among the 
processors, so that each processor stores aiVx (n/p) block of Y and a n x (N/p) block of 
y T . The horizontal dotted lines show the division of Y and Y T into blocks that must be 
communicated between processors to achieve the transposition: the block is sent from 
processor j to processor %. Processor % then transposes the block and stores it in the jith 
section of Y T . 



7.4.2 Distributed-memory (DMP) architectures 

Parallelization on DMP architectures requires careful thought regarding how the memory 
distribution and inter-process communications are to be implemented. The most memory- 
consuming computational objects are the column_bundles, and for a large system, no single 
processor in a DMP computer can store the entire data structure in its local memory banks. 
Therefore, we parallelize the storage of column_bundles by distributing equal numbers of 
the columns comprising a column_bundle among the processors. 

Given this distribution by columns, the application of basis-dependent operators is 
unchanged from how it is done in a serial context: each processor applies the operator to the 
columns that are assigned to it, and perfect parallelization is achieved as no inter-processor 
communication is required. The large basis-independent matrix products, however, are more 
challenging to parallelize as they necessarily involve inter-processor communication. 

Consider again the product M = Y±Y 2 , which in terms of components is given by 

Mij = E(^(^- • 

k 

The column-wise parallel distribution of Yi and Y 2 means that the full range of the % and j 
indices is distributed among the processors while the full range of the k index is accessible 
locally by each processor. Since Y x and Y 2 are N x n, each processor has aiVx (n/p) block 
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Figure 13: Scaling of DMP parallelization - Performance of the DMP parallelized plane- 
wave code for a 256 silicon atom system as a function of the number of processors for the 
parallelized M = Y^Yi multiplication (circles) and for the code overall (pluses). In both 
cases, the performance has been normalized to the respective performance with a single 
processor based on extrapolation from the eight processor run. The straight line with slope 
of unity represents ideal scaling for perfect parallelization. 



(i.e. n/p columns of length N) in its local memory, where p is the number of processors. 
Unfortunately, computing M using this column-wise distribution requires a great deal of 
inter-processor communication. For example, the processor computing the ith row of M 
requires knowledge of all the columns of Y%, so that in total, the Nn data items describing 
Y% will have to be sent p — 1 times between all the processors. 

A more efficient communication pattern can be developed that avoids this redundancy. 
Denoting the transpose of Y by Y T , the matrix product can be rewritten as 

M ij = J2(Y 1 T )* k (Y 2 T ) jk . 

k 

Hence, if we first transpose Y\ and Y2, then the column- wise distribution of the transposed 
column_bundl.es insures that the full range of the i and j indices can be accessed locally on 
each processor while the full range of the k index is now distributed among the processors. 
Figure ^ presents a schematic of how the transposition of a column_bundle Y is be accom- 
plished: each processor has a N x (n/p) block of Y whose contents it sends to p — 1 other 
processors as p — 1 blocks of size (N/p) x (n/p). Next, each processor receives p — 1 similar 
blocks sent to it by other processors, transposes them, and stores them in the appropriate 
sections of Y T . Each processor sends or receives only Nn(p — l)/p 2 data items, and a total 
of Nn(p — l)/p data items are communicated among all the processors. 
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Figure 14: Amdahl's analysis of DMP scaling - Total execution time of the DMP parallelized 
plane wave code for a 256 silicon atom system as a function of the reciprocal of the number of 
processors utilized. The line of least-squares fit to the data points is shown as well. Execution 
times are normalized in units of the execution time for a single processor as extrapolated 
based on the eight processor run. 



The computation of M in the transposed mode has the same operation count as in 
the non-transpose mode (i.e. 0(n 2 N) operations), which when distributed across processors, 
amounts to 0(n 2 N/p) operations per processor. Of course, we block the matrix multipli- 
cations on each processor to ensure the best performance. Finally, a global sum across all 
the processors' n x n resulting matrices is required to obtain the final answer M, and this 
requires log 2 p communications of size n 2 when using a binary tree. 

Assuming that processors can simultaneously send and receive data across the net- 
work, the time required to perform the transpose is 0(nN/p), the time needed to perform 
the multiplications is 0(n 2 N/p), and the time required to perform the final global sum is 
0(n 2 log 2 p). For large problem sizes, the time needed to perform the multiplications will 
always be larger than the time required for the communications by a factor of ~ n. Thus 
interprocessor communications are, in the end, never an issue for a sufficiently large physical 
system, and the computation will be perfectly parallelized in the asymptotic limit of an 
infinitely large system. 

Figure 13 shows a plot of the performance of our DMP parallelized plane-wave code 
for a single conjugate-gradient step when run on a 256 atom silicon cell. The choice of 
pseudopotential, cutoff, and fc-point sampling are the same as for the SMP calculation above. 
The basis set is of size N = 23563 and the system has n = 512 valence bands. The 
calculations were run on an IBM SP2 system with 336 nodes, and each node has four 332 
MHz Power2 Architecture RISC System/6000 processors and 1.536 GB of memory. Again, we 
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see excellent and near perfect scaling for the parallelized matrix multiplications, validating 
our claim that the transposition approach combined with the large system size provides 
for very good parallelization. With 32 processors, each parallelized multiplication runs at 
an average rate of 188 MFLOPS per processor (57% of the processor clock rate), which is 
impressive given the fact that the processors are busily communicating the data required for 
the transpositions and global sums. 

The plot also shows the saturation of the total performance of the code with increas- 
ing number of processors. With eight processors, the overall performance translates into an 
average rate of 254 MFLOPS per processor (76% of the clock rate), whereas with 32 proces- 
sors the rate has reduced to 160 MFLOPS per processor (48% of the clock rate). Clearly, 
the serial portions of the calculation begin to contribute significantly to the run time of the 
code. Following the discussion of the SMP results above, Figure [H| presents a plot of the 
total execution time versus inverse number of processors. The extrapolated serial time T 
in this case is only 2-3% of the total theoretical run time on a single processor, which shows 
how effectively the calculation has been parallelized. In addition, for the 32 processor run, 
the total run time is only two times larger that T , signalling the end of significant gains 
from the use of more processors. 
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A Plane-wave implementation of the basis-dependent 
operators 

A widely used basis-set for ab initio calculation has been the plane- wave basis set |7| . Plane 
waves are ideally suited for periodic calculations that model the bulk of a crystalline material. 
In addition, plane waves provide uniform spatial resolution throughout the entire simulation 
cell, and the results of the calculations can be converged easily by simply increasing the 
number of plane waves in the basis set. We will use plane waves as a concrete example of 
how the basis-dependent operators of Section [O] are to be implemented. 

Given the lattice vectors of a periodic supercell, we compute the reciprocal lattice 
vectors and denote the points of the reciprocal lattice by the vectors {G}. Each element of 
our basis set {b a (r)} will be a plane wave with vector G a , 




where f2 is the real-space volume of the periodic cell. This basis is orthonormal, and the 
overlap operator O is the identity operator, 

= 1. 



The integrals of the basis functions s are given by 

s a = VfL5 Gafi , 

so that the O operator is given by 

O a j3 = 0~a tl3 — S Ga ^ ■ 5 G[}) q . 

The Laplacian operator L is diagonal in this basis and is given by 

L a /3 — — 1| G a \\ 2 5 a p . 

The forward transform X is given by a Fourier transformation. Specifically, for a point 
p on the real-space grid, we have that 



X 



'n 



Consider applying X to the column vector x an d evaluating the result at the point p: 

a V" a 



This is the forward Fourier transform of X- 

For the case of plane waves, the inverse transform J can be chosen to be the inverse 
of X, J = X^ 1 , as per the discussion of Section fO. It follows that 
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where N is the number of points in the real-space grid. Applying J to a column vector £, 
we have 

i V p 

Thus ^ is a reverse Fourier transform. The operators X" and J7"^ are also Fourier transforms 
with appropriate scaling factors. Computationally, Fast Fourier Transforms |2T| can be 
used to implement these operators most efficiently 

The last remaining basis-dependent item is the ionic potential V^ on . For periodic 
systems, this potential is a periodic sum of atomic potentials V at (r), 

V lon (r)=J2V at (r-R-r I ), 

R,I 

where R ranges over the real-space lattice sites, / ranges over the atoms in the unit cell, and 
77 is the position of the 7th atom. Based on this, the elements of the vector Vi on of Eq. ( [Op 
are given by 

/-rr \ S(Ga)V a t(Ga) 
\ vion )a ' 

where the structure factor S(q) is given by 

S(q) = £ e^' , 

and the Fourier transform of the atomic potential V at is defined by 

V at (q) = Jd 3 r e-^Vatir). 



B Non-local potentials 

In this section we show how non-local potentials can easily be incorporated into the DFT++ 
formalism. The total non-local potential operator V for a system is given by a sum over 
each atom's potential, 

where V} is the non-local potential of the Jth atom. The non-local energy is given by the 
expectation of V over all the occupied states {ipi} with fillings fi, 

E nl = Y,fi(^\v\A) = J2T,fi(^\Vi\^)- 

i I i 

Given the linearity of E n \ with respect to the atoms /, in our discussion below we will only 
consider the case of a single atom and will drop the index /. The results below can then be 
summed over the atoms to provide general expressions for multiple atoms. 

Using the expansion coefficients C a i of Eq. (^), we rewrite the non-local energy as 

E nl = J2M^\V\A) = £ hClMV^Cpi = Tr(UCFCt) = Ti(VP) , (45) 
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where |a) is the ket representing the basis function b a (r), P is the single-particle density 



matrix of Eq. (|20|) , the matrix F was defined as F^ 
non-local potential are defined as 



Sijfi, and the matrix elements of the 



V aP = (a\V\P) = d 3 r d 3 r' b* a (r)V(r,r')bp(r') . 



The matrix V clearly depends on both the basis set and the potential. The contribution 
of the non-local potential to the total Lagrangian of Eq. (^) is given simply by Tr(VP). 
Following the derivations of Eqs. (^) and fl5S|), we see that the single-particle Hamiltonian 
H is modified only by the addition of V, 



H 



Jtpiagigj + V. 
We now write the potentials in separable form, 



2 



V = J2\s}M ss/ ( S '\ 



(46) 



(47) 



where s and s' range over the quantum states of the atom, M ss > are matrix elements specifying 
the details of the potential, and \s) is the ket describing the contribution of the sth quantum 
state to the potential. Typical choices of s are the traditional atomic state labels nlm and 
possibly the spin a. Once we define the matrix elements K as = (a\s), which are again 
basis-dependent, we find two equivalent forms for E n i, 



E, 



nl 



E 

i,s,s' 



f t (^\s)M ss ,(s'\^)= J2 



fiC* ai K as 



i,s,s',a,0 



Tr \M (K^C)F(K^Cy] = Tr [KMK^P 



(48) 



The first form involving K*C is most useful for efficient computation of the energy, and the 
second form is most useful for derivation of the gradient (cf. the discussion of Eqs. fl28f) and 
(p9|)). The energetic contribution to the Lagrangian is given by Eq. ([48"D and the contribution 
to the Hamiltonian H is KMK^ , which replaces V in Eqs. ( ^5|) and (|46|). 

A further specialization involves the popular case of Kleinmann-Bylander potentials 
20| where the double sum in Eq. (|47| ) is reduced to a single sum over s. Thus, the matrix 



M is diagonal with elements m s . The expression for the total non-local energy, this time 
including the sum over atoms /, is 



= Y,™>is{K\ s C)F{K\ s Cy = Tr 



l.s 



J2m Is K Is Kl)P 



Unfortunately, this expression is not very efficient for evaluating the energy of a system with 
many atoms, as the sum on I is large but the matrix K\ S C only has a single row. This limits 
our ability to exploit the cache effectively (which only occurs for large matrix sizes). 

We can rewrite the above energy expression so as to employ larger matrices and thus 
achieve greater computational efficiency To do this, we define a diagonal matrix M s that 
contains the rrij s values for all the atoms, (M s )jj = Sjjmj s , and we define the matrices A s 
via {A s ) a j = Ki s a . We then reorganize the previous expression for the non-local energy, 



E nl = J2^\M s (AtC)F(Alcy 



Tr 



J2A s M s At)p 
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If we have N basis functions, n quantum states {ipi}, and n a atoms in the system, then C is 
N xn and A s is N xn a . Thus, for large system sizes, the products A\C involve matrices with 
large dimensions, and optimized matrix-multiplication routines function at peak efficiency. 

C Multiple /c-points 

We consider the generalization of our formalism to the case of multiple fc-points, which arises 
in the study of periodic systems. In periodic cells, the wave functions satisfy Bloch's theorem 
and can be labeled by a quantum number k, a vector in the first Brillouin zone. The quantum 
states obey the Bloch condition 

Mr + R) = e lhR Mr), 

where R is a lattice vector of the periodic cell. This implies that ipk(r) = e tk r Uk{r) where 
Uk is a periodic function of r, Uk{r + R) = Uk(r). We define the expansion coefficients C\~ for 
the vector k as being those of the periodic function Uk(r) and arrive at (cf. Eq. (|7|)) 

iJk m (r) = e lk - r Y,(Ck)* m b a (r), (49) 

a 

where the integer m labels the energy bands (i.e. different states at the same value of k). 
The Fermi-Dirac fillings may also have a fc-dependence and are denoted as f\~ m . 

In addition to /c-vectors, calculations in periodic systems attach a weight to the 
wave vector k. The rationale is that we require the integrals of physical functions over the 
Brillouin zone in order to compute the Lagrangian, energies, and other quantities. Ideally, we 
would like to integrate a function g(k) over the Brillouin zone, but in a practical computation 
this must be replaced by a discrete sum over a finite number of fc-points with weights w^. 
That is, we perform the following replacement 

f Sk g{k) ^Y, w k9{k) ■ 
J k 

The required generalizations of the DFT++ formalism are straightforward and are 
outlined below. The density matrices (cf. Eq. (|20|)) now depend on Appoints 

Pk = WkCkFkC\ , 

where the filling matrix is (Fk) mm i = S m ^ m 'fkm, and the expansion coefficient matrices Ck 
are given by Eq. d49|) . We define the total density matrix P through 

k 

The electron density n (cf. Eq. (pT|)) is given by 

n = ^diag(jP fc J t ) . 

k 

The electron-ion, exchange- correlation, and electron-Hartree energies depend only on n, and 
provided the above /c-dependent expression for n is used, these contributions require no 
further modification from the forms already given in Eqs. (p3|), ([24]), and (|27|) respectively. 
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The only change required to the basis-dependent operators involves the use of the 
Laplacian for computing the kinetic energy. The proper generalization is to define k- 
dependent Laplacian matrices L k through 



(L 



k)af3 



d 3 r 



Ak-r 



b a (r) 



Ak-r 



This immediately leads to the following expression for the kinetic energy: 

z k 

We still require the operator L as defined in Eq. flj^) for operations involving the Hartree 
field (f) and the Poisson equation. The L operator is L k evaluated at k — 0. 

The generalization of non-local potentials (Appendix ||) to multiple fc-points is also 
straightforward. The energy expression of Eq. fl45|) generalizes to 



E nl = J2^(VP k ). 

k 

Having completed the specification of the Lagrangian with multiple fc-points, the gen- 
eralizations required for the orthonormality condition and the expressions for the derivatives 
of the Lagrangian follow immediately. We introduce overlap matrices U k and unconstrained 
variables Y k (cf. Eqs. (0) and ©), 



U, 



Y k J OY k and C k 



YkU k 



-1/2 



where for simplicity we have set all subspace-rotation matrices to identity, V k 
differential of the Lagrangian takes the form (cf. Eq. (j35"[)) 



/. The 



dC LDA = J2^(H k dP k ). 



The single-particle Hamiltonians H k depend on k only through the kinetic operators L k , 

H k = -h k +I^[DmgV sp ]I + V. 

The expression for the single-particle potential V sp is unmodified from that of Eq. ( PB[ ) as it 
only depends on the total electron density n. The term V is to be added only if non-local 
potentials are employed (see Appendix |B|). 

The expressions of Eq. ([37]) for the derivative of the Lagrangian also generalize in the 
following straightforward way, 



dC-LDA 

' dC LDA \ 
, dYl ) 
H k 



2 Re J2 Tr 



dY l 



'dC 



LDA 



dYl )j 



where 



= w k {(l-OC k Cl)H k C k F k U^ 1/2 + OC k Q k ([H k ,F k })}, and 
= C\H k C k , 



where Q k is the natural generalization of the Q operator which uses the eigenvalues and 
eigenvectors of U k (Appendix 0). 
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D Complete LDA code with /c-points and non-local po- 
tentials 



In this section, we summarize and gather together the expressions for the LDA Lagrangian 
and its derivatives in the DFT++ formalism for a system with /c-points and non-local po- 
tentials. This type of system provides the natural starting point for studying bulk systems 
and the properties of defects in bulk-like systems f7|. 

As we have emphasized previously, it is sufficient for us to display the formulae for 
the Lagrangian and its derivatives because formulae in the DFT++ language specify all 
the operations that must be performed and translate directly into computer code. (See 
Section |7|.) Given the Lagrangian and its derivatives, we can use a variety of methods to 
achieve self-consistency. (See Section ^.) 

We follow the notation of Appendix and refer the reader to it for relevant details 
and definitions. The point we wish to emphasize is the compactness of the formalism and 
how it allows us to specify an entire quantum-mechanical Lagrangian or energy function in 
a few lines of algebra which explicitly show the operations required for the computation. We 
specialize to the case of Kleinmann-Bylander [E(J non-local potentials (Appendix IBI). 



C 



LDA 



-- ™k Tr (F k ClL k C k ) + (Jn) ] [V ion + OJe xc (n) - 0(f) 
1 k 

+ 5> fc £ Tr (M s (AlC k )F k (AlC k y) + ^L<f>, 



dC 



LDA 



{ (/ - OC k C\) H k C k F k U^ l/2 + OC k Q k ([CtH k C k , F k 



dC 



LDA 



<90t 



I 07T 



H k 



sp 



--L k + Jt [Diag V sp ] Z + £ A S M S A\ , 

^ s 

J f V ion + J f OJe xc (n) + [Diag e' xc {n)]J^OJn - J^0<\> . 



E The Q operator 

In this appendix, we define the Q operator which appears in expressions for the derivative 
of the Lagrangian, e.g. in Eq. Q3"7|). The formal properties satisfied by Q are also presented, 
properties used in the derivation of the expression for the derivative based on the connection 
between Q and the differential of the matrix U 1 ^ 2 . (See the derivation starting from Eq. ( |36|) 
and resulting in Eq. (0) in Section |4.5.1|.) 



We start with the Hermitian matrix U. Let \x be a diagonal matrix with the eigen- 
values of U on its diagonal, and let W be the unitary matrix of eigenvectors of U. Thus, the 
following relations hold: U = WpW\ W^W = WW^ = J, and UW = Wfi. 
Consider the differential of the matrix U. The Leibniz rule results in 



dU = dWfi W ] + W dfi + Wfi dW ] . 



53 



Using the unitarity of W, we have 



W*dUW = WUWfx + n dW ] W + dp . 

Differentiating the relation W^W = I, we have that dW^W + W^dW = or dW^W = 
—W^dW. Substituting this above, we arrive at the relation 

W j dUW = [WUW, A+dfi. (50) 

This equation describes how differentials of the eigenvalues and eigenvectors of U are related 
to the differential of U, and it is simply a convenient matrix-based expression of the results 
of first-order perturbation theory familiar from elementary quantum mechanics. To see this 
equivalence, we first examine the diagonal elements of Eq. (|50|) and find 

dfi n = (WUUW) nn , (51) 

the familiar expression for the first order shift of the eigenvalue /i n . Considering off diagonal 
matrix elements of Eq. fl50"D leads to 



(WUW) nm = ( WUUW )nm ioln _^ m ^ (52) 
fl>m fJ>n 

which is the expression for the first order shift of the mth wave function projected on the 
nth unperturbed wave function. 

Next, we consider f(U), an arbitrary analytic function of U. Using the eigenbasis of 
U, we can write f(U) = Wf([i)W> where by f(fi) we mean the diagonal matrix obtained 
by applying / to each diagonal entry of u separately. Following the same logic as above, the 
differential of f(U) satisfies 

WU[f(U)]W = [WUW, f(ji)] + f'(ji)dfi . 



Computing matrix elements of the above equation and using Eqs. fl5"ID and (52), we arrive 
at the general result 

f f r {uL 1 if Tfl = Tl 

(WU[f(U)]W) nm = (WUUW) nm -{ U (L)-?L) ] iim ^ n • (53) 



We now apply this result to the case where f(U) = U 1 ^ 2 . This means that f(fi n ) — 
^fJTn~ and that /'(u n ) = 1/(2*//!^) in Eq. fl53l) . By employing the algebraic identity (\/x — 
■Jy)/(x — y) = 1/ (y/x + y/y), we arrive at the expression 

(WU[U^]W) nm = ggg^b = (W^Q(dU)W) nm , (54) 

where we define the operator Q(A) for an arbitrary matrix A to be 

(W^Q(A)W) nm = W A ™)™_ . (55) 
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From this definition of Q, it is easy to prove that the following identities are satisfied for 
arbitrary matrices A and B and arbitrary power p: 

Q(dU) = d[U 1/2 } 

Tr (Q(A)B) = Tr (AQ(B)) 

Tr (Q(U P A)B) = Tr (U P Q(A)B) 

Tt{Q{AU p )B) = Tt{Q{A)U p B) 

A = Q(A)U 1/2 + U 1/2 Q(A). (56) 

These are the identities used in the derivation of the expression for the derivative of the 



Lagrangian with respect to Y in Section 4.5.1. 
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